next up previous contents
Next: Matching Multiple Occurences of Up: Regular Expressions Previous: Tagging of Regular Expressions   Contents


Greediness of Regular Expressions

In the previous example, the tagged expression explicitly consisted of non-quote characters. Since the regular expression must be terminated by a quote character, would it be possible to specify the tagged pattern as simply one or more of any character, that is, the regular expression ``.+''? If there are no other quote symbols on the line being matched, such a regular expression will work exactly the same as the one which explicitly requires non-quote characters. But if there are any other quotes on the same line, the tagged expression will keep ``eating'' characters until it gets to the very last quote on the line. For this reason, the default behavior of regular expressions is said to be greedy. A given regular expression will always match the largest piece of text it can find that satisfies the condition. To make the issue more concrete, consider the following:
$str = '<img src = "/one.jpg"> <br> <img src = "/two.jpg">'; 
$str =~  /< *img +src *= *["'](.+)["']/i;
$imagename = $1;
print $imagename;  # prints /one.jpg"> <br> <img src = "/two.jpg
There are two solutions to this problem. The first is illustrated in the previous section, that is, to use a negated character class to prevent the regular expression from matching past the first quote. As soon as a quote character is encountered, the match fails, and only the string inside the quotes is returned. The second solution is to modify the behavior of the regular expression matching algorithm so that it becomes non-greedy, and attempts to match the shortest pattern within the string that still matches the regular expression. This is done by following the * or + regular expression modifiers with a question mark (?). Thus the following simple change to the previous code fragment will result in $imagename containing just the first image name and not the additional text up until the last quote character.
     $str =~  /< *img +src *= *["'](.+?)["']/i;
Whether to be more specific about the nature of the character to be matched or to simply make the + or * modifiers non-greedy is mostly just a matter of choice. Be advised, however, that not all programs which work with regular expressions have the capability of making these operators non-greedy.


next up previous contents
Next: Matching Multiple Occurences of Up: Regular Expressions Previous: Tagging of Regular Expressions   Contents
Phil Spector 2002-10-18