<img src="/images/back.gif">When constructing a regular expression in situations like this, it's important to consider the variations which may exist in practical applications. For example, the HTML standard allows blanks around its keywords, as well as upper or lower case,and filenames surrounded by single or double quotes. Thus, to compile a regular expression which would match constructions like the one above we could use the following statement:
>>> imgpat = re.compile(r'< *img +src *= *["\'].+["\']',re.IGNORECASE)Note the use of backslashes before the single quotes in the regular expression. Since single quotes were used to delimit the regular expression, they must be escaped inside the expression itself. Alternatively, triple quotes could be used:
>>> imgpat = re.compile(r'''< *img +src *= *["'].+["']''',re.IGNORECASE)When we use this regular expression, it will find the required pattern, but there is no simple provision for extracting just the image name. To make it easy to access a portion of a matched regular expression, we can surround a portion of the expression with parentheses, and then use the groups or group method of the returned matched object to access the piece we need. Alternatively, the findall method will return all the tagged pieces of a regular expression.
To extract just the image name from text using the above expression, we first must include parentheses around the portion of the regular expression corresponding to the desired image name, then use the search function to return an appropriate match object, and finally invoke the group method on the match object, passing it the argument 1 to indicate that we want the first tagged expression.
>>> imgtext = '<IMG SRC= "../images/picture.jpg"><br>Here is a picture' >>> imgpat = re.compile(r'''< *img +src *= *["'](.+)["']''',re.IGNORECASE) >>> m = imgpat.search(imgtext) >>> m.group(1) '../images/picture.jpg'If the group method is passed the value 0, it will return the entire text which matched the regular expression; if it's passed a list of numbers, it will return a tuple containing the corresponding tagged expressions. The groups method for match objects returns all of the tagged expressions in a tuple.
The image name could also be extracted using findall:
>>> imgpat.findall(imgtext) ['../images/picture.jpg']Note that findall returns a list, even when there is only one element.