Greediness of Regular Expressions

Next: Multiple Matches Up: The re module: Regular Previous: Using Named Groups for Contents

Greediness of Regular Expressions

Suppose that we try to use the regular expression for image names developed previously on a string containing more than one image name:

>>> newtext = '<img src = "/one.jpg"> <br> <img src = "/two.jpg">' 
>>> imgpat.findall(newtext)
['/one.jpg"> <br> <img src = "/two.jpg']

Instead of the expected result, we have gotten the first image name with additional text, all the way through the end of the second image name. The problem is in the behavior of the regular expression modifier plus sign (+). By default, the use of a plus sign or asterisk in a regular expression causes Python to match the longest possible string which will still result in a successful match. Since the tagged expression (.+) literally means one or more of any character, Python continues to match the text until the final closing double quote is found.

To prevent this behaviour, you can follow a plus sign or asterisk in a regular expression with a question mark (?) to inform Python that you want it to look for the shortest possible match, overriding its default, greedy behavior. With this modification, our regular expression returns the expected results:

>>> imgpat = re.compile(r'''< *img +src *= *["'](.+?)["']''',re.IGNORECASE)
>>> imgpat.findall(newtext)
['/one.jpg', '/two.jpg']

Next: Multiple Matches Up: The re module: Regular Previous: Using Named Groups for Contents

Phil Spector 2003-11-12