>>> newtext = '<img src = "/one.jpg"> <br> <img src = "/two.jpg">' >>> imgpat.findall(newtext) ['/one.jpg"> <br> <img src = "/two.jpg']Instead of the expected result, we have gotten the first image name with additional text, all the way through the end of the second image name. The problem is in the behavior of the regular expression modifier plus sign (
+
).
By default, the use of a plus sign or asterisk in a regular expression causes Python to
match
the longest possible string which will still result in a successful match. Since
the tagged expression (.+
) literally means one or more of any character, Python
continues to match the text until the final closing double quote is found.
To prevent this behaviour, you can follow a plus sign or asterisk in a regular expression
with a question mark (?
) to inform Python that you want it to look for the
shortest possible match, overriding its default, greedy behavior. With this modification,
our regular expression returns the expected results:
>>> imgpat = re.compile(r'''< *img +src *= *["'](.+?)["']''',re.IGNORECASE) >>> imgpat.findall(newtext) ['/one.jpg', '/two.jpg']