next up previous contents
Next: Substitutions Up: The re module: Regular Previous: Greediness of Regular Expressions   Contents

Multiple Matches

We've already seen that the findall method can return a list containing multiple occurrences of a match within a string. There are a few subtleties in the use of findall which should be mentioned, as well as alternatives which may be useful in certain situations.

One consideration about findall is that if there are tagged subexpressions within the regular expression, findall returns a list of tuples containing all of the tagged expressions. For example, consider matching a pattern consisting of a number followed by a word. To capture the number and word as separate entities, we can surround their patterns by parentheses:

>>> tstpat = re.compile(r'(\d+) (\w+)')
>>> tstpat.findall('17 red 18 blue')
[('17', 'red'), ('18', 'blue')]
But what if we include parentheses in the regular expression for purposes of grouping only? Consider the problem of identifying numeric IP addresses in a text string. A numeric IP address consists of four sets of numbers separated by periods. A regular expression to find these addresses could be composed as follows:
>>> ippat = re.compile(r'\d+(\.\d+){3}')
Note that since we are looking for a literal period, we need to escape it with a backslash, to avoid it being interpreted as a special character representing any single character. If we now use findall to extract multiple IP addresses from a text line, we may be surprised at the result:
>>> addrtext = 'Python web site: 132.151.1.90 \
... Google web site: 216.239.35.100'
>>> ippat.findall(addrtext)
['.90', '.100']
The problem is that Python interprets the parentheses as tagging operators, even though we only wanted them to be used for grouping. To solve this problem, you can use the special sequence of characters (?: to open the grouping parentheses. This informs Python that the parentheses are for grouping only, and it does not tag the parenthesized expression for later extraction.
>>> ippat = re.compile(r'\d+(?:\.\d+){3}')
>>> addrtext = 'Python web site: 132.151.1.90  \
... Google web site: 216.239.35.100'
>>> ippat.findall(addrtext)
['132.151.1.90', '216.239.35.100']

More control over multiple matches within a string can be achieved by using the match object returned by search or match. This object has, among other information, two methods called start and end which return the indices in the matched string where the match was found. If these methods are called with an argument, they return the starting and ending indices of the corresponding tagged groups; without an argument, they return the indices for the entire match. Thus, by slicing the original string to remove matches as they are found, multiple matches can be processed one at a time. Like so many other features of Python, the choice of using findall or processing the match object is usually a personal one -- you just have to decide in a given setting which one will be the most useful.

To process the IP addresses in the previous example one at a time, we could use code like the following

>>> addrtext = 'Python web site: 132.151.1.90  \
... Google web site: 216.239.35.100'    
>>> newtext = addrtext  
>>> ippat = re.compile(r'\d+(?:\.\d+){3}')  
>>> mtch = ippat.search(newtext)  
>>> count = 1       
>>> while mtch:
...     print 'match %d: %s' % (count,mtch.group(0))
...     count = count + 1
...     newtext = newtext[mtch.end(0) + 1:]
...     mtch = ippat.search(newtext)
... 
match 1: 132.151.1.90
match 2: 216.239.35.100


next up previous contents
Next: Substitutions Up: The re module: Regular Previous: Greediness of Regular Expressions   Contents
Phil Spector 2003-11-12