next up previous contents
Next: Greediness of Regular Expressions Up: The re module: Regular Previous: Tagging in Regular Expressions   Contents

Using Named Groups for Tagging

When you only have one or two tagged groups in a regular expression, it isn't too difficult to refer to them by number. But when you have many tagged expressions, or you're aiming to maximize the readability of your programs, it's handy to be able to refer to variables by name. To create a named group in a Python regular expression, instead of using plain parentheses to surround the group, use parentheses of the form (?P<name>...), where name is the name you wish to associate with the tagged expression, and ...represents the tagged expression itself. For example, suppose we have employee records for name, office number and phone extension which look like these:
Smith 209 x3121
Jones 143 x1134
Williams 225 555-1234
Normally, to tag each element on the line, we'd use regular parentheses:
    recpat = re.compile(r'(\w+) (\d+) (x?[0-9-]+)')
To refer to the three tagged patterns as name, room and phone, we could use the following expression:
    recpat1 = re.compile(r'(?P<name>\w+) (?P<room>\d+) (?P<phone>x?[0-9-]+)')
First, note that using named groups does not override the default behaviour of tagging - the findall function and method will still work in the same way, and you can always refer to the tagged groups by number. However, when you use the group method on a match object returned by search or match, you can use the name of the group instead of the number (although the number will still work):
>>> record = 'Jones 143 x1134'   
>>> m = recpat1.search(record)                 
>>> m.group('name')      
'Jones'
>>> m.group('room')
'143'            
>>> m.group('phone')
'x1134'

Now suppose we wish to refer to the tagged groups as part of a substitution pattern. Specifically, we wish to change each record to one with just the room number followed by the name. Using the pattern without named groups, we could do the following:

>>> recpat.sub(r'\2 \1',record)
'143 Jones'
With named groups, we can use the syntax \g<name> to refer to the tagged group in substitution text:
>>> recpat1.sub('\g<room> \g<name>',record)
'143 Jones'

To refer to a tagged group within a regular expression, the notation (?P=name) can be used. Suppose we're trying to detect duplicate words appearing next to each other on the same line. Without named groups, we could do the following:

>>> line = 'we need to to find the repeated words'
>>> re.findall(r'(\w+) \1',line)
['to']
Using named groups we can make the regular expression a little more readable:
>>> re.findall(r'(?P<word>\w+) (?P=word)',line)
['to']
Notice when this form for named groups is used, the parentheses do not create a new grouped pattern.


next up previous contents
Next: Greediness of Regular Expressions Up: The re module: Regular Previous: Tagging in Regular Expressions   Contents
Phil Spector 2003-11-12