next up previous contents
Next: Using Named Groups for Up: The re module: Regular Previous: Finding Regular Expression Matches   Contents

Tagging in Regular Expressions

In the previous section, our interest was in the entire regular expression (for an email address), so extracting the entire expression from a string would be sufficient for our purposes. However, in many cases, the patterns we wish to find are determined by context, and we will need to extract subsections of the pattern. Consider the problem of extracting the names of images referenced in a web page. An example of such a reference is
    <img src="/images/back.gif">
When constructing a regular expression in situations like this, it's important to consider the variations which may exist in practical applications. For example, the HTML standard allows blanks around its keywords, as well as upper or lower case,and filenames surrounded by single or double quotes. Thus, to compile a regular expression which would match constructions like the one above we could use the following statement:
>>> imgpat = re.compile(r'< *img +src *= *["\'].+["\']',re.IGNORECASE)
Note the use of backslashes before the single quotes in the regular expression. Since single quotes were used to delimit the regular expression, they must be escaped inside the expression itself. Alternatively, triple quotes could be used:
>>> imgpat = re.compile(r'''< *img +src *= *["'].+["']''',re.IGNORECASE)
When we use this regular expression, it will find the required pattern, but there is no simple provision for extracting just the image name. To make it easy to access a portion of a matched regular expression, we can surround a portion of the expression with parentheses, and then use the groups or group method of the returned matched object to access the piece we need. Alternatively, the findall method will return all the tagged pieces of a regular expression.

To extract just the image name from text using the above expression, we first must include parentheses around the portion of the regular expression corresponding to the desired image name, then use the search function to return an appropriate match object, and finally invoke the group method on the match object, passing it the argument 1 to indicate that we want the first tagged expression.

>>> imgtext = '<IMG  SRC= "../images/picture.jpg"><br>Here is a picture'
>>> imgpat = re.compile(r'''< *img +src *= *["'](.+)["']''',re.IGNORECASE)
>>> m = imgpat.search(imgtext)
>>> m.group(1)
'../images/picture.jpg'
If the group method is passed the value 0, it will return the entire text which matched the regular expression; if it's passed a list of numbers, it will return a tuple containing the corresponding tagged expressions. The groups method for match objects returns all of the tagged expressions in a tuple.

The image name could also be extracted using findall:

>>> imgpat.findall(imgtext)
['../images/picture.jpg']
Note that findall returns a list, even when there is only one element.


next up previous contents
Next: Using Named Groups for Up: The re module: Regular Previous: Finding Regular Expression Matches   Contents
Phil Spector 2003-11-12