next up previous contents
Next: Compiling Regular Expressions Up: The re module: Regular Previous: Introduction to Regular Expressions   Contents

Constructing Regular Expressions

Regular expressions in Python are strings containing three different types of characters: literal characters, which represent a single character; character classes, which represent one of several different characters, and modifiers, which operate on characters or character classes. Literal characters include digits, upper and lower case letters and the special characters listed in Table 2.1. Because many of the usual punctuation characters have a special meaning when used in regular expressions, when you need to use one of these characters in a regular expression, you need to precede it with a backslash (\). These characters are
          . ^ $ + ? * ( ) [ ] { } | \

A character class is represented by one or more characters surrounded by square brackets ([]). When Python encounters a character class in a regular expression, it will be matched by an occurrence of any of the characters within the character class. Ranges of characters (like a-z or 5-9) are allowed in character classes. (If you need to specify a dash inside a character class, make sure that it is the first character in the class, so that Python doesn't confuse it with a range of characters.) If the first character in a character class is a caret (^), then the character class is matched by any character except those listed within the square brackets. As a useful shortcut, Python provides some escape sequences which represent common character classes inside of regular expressions. These sequences are summarized in Table 8.1


Table 8.1: Escape sequences for character classes
Symbol Matches Symbol Matches
\w Alphanumerics and _ \W Non-alphanumerics
\d Digits \D Non-digits
\s Whitespace \S Non-whitespace


As mentioned previously, certain punctuation symbols have special meanings inside of regular expressions. The caret (^) indicates the beginning of a string, while the dollar sign ($) indicates the end of a string. Furthermore, within a regular expression, parentheses can be used to group together several characters or character classes. Finally a number of characters known as modifiers and listed in Table 8.2 can be used within regular expressions. Modifiers can follow a character, character class or a parenthesized group of characters and/or character classes, and expand the range of what will be matched by the entity which precedes them. For example, the regular expression 'cat' would only be matched by a string containing those specific letters in the order given, while the regular expression 'ca*t' would be matched by strings containing sequences such as ct, caat, caaat, and so on.

Table 8.2: Modifiers for Regular Expressions
Modifier Meaning
. matches any single character except newline
| separates alternative patterns
* matches 0 or more occurrences of preceding entity
? matches 0 or 1 occurrences of preceding entity
+ matches 1 or more occurrences of preceding entity
{n} matches exactly n occurrences of preceding entity
{n,} matches at least n occurrences of preceding entity
{n,m} matches between n and m occurrences



next up previous contents
Next: Compiling Regular Expressions Up: The re module: Regular Previous: Introduction to Regular Expressions   Contents
Phil Spector 2003-11-12