next up previous contents
Next: Options to the Regular Up: Regular Expressions Previous: Operators to Work with   Contents

Constructing Regular Expressions

Regular expressions have three basic components: literal characters, representing a single character, character classes, which are matched by any of several different characters, and modifiers, which operate on characters, character classes, or combinations of the two. Included among the literal characters are all the numeric and alphabetic characters, as well as some punctuation characters, and all the special escapes listed in the top half of Table 3.1 (except for \b, as explained below.) Since many punctuation characters serve as modifiers within regular expressions, all of the characters listed in Table [*] must be preceded by a backslash (\) if they are to be matched by a regular expression. That means that if you are literally looking for any of the following characters:
     . ^ $ + ? * ( ) [ ] { } | \
you must precede them with a backslash.

A character class is represented in a regular expression by a series of characters surrounded by square brackets ([]), and will be matched by any of the characters within the brackets. When you specify one of the ``special'' punctuation characters mentioned above inside a character class, you don't need to use the backslash (except for square brackets, dashes (-; see below) and backslash itself), but there is no harm in doing so. If the first character of a character class is the caret (^), then the character class will be matched by any character except those characters which are between the square brackets. Such a construction is known as a negated character class.

Several shortcuts are available when you're writing a character class. Ranges of letters or digits can be specified by placing a dash between the beginning and ending characters. Thus to literally include a dash in a character class you must precede it with a backslash. Furthermore, perl provides special escape sequences, listed in Table [*]. Each of the sequences in Table [*] can be used by itself to represent a character class, or can be included inside a character class to extend the range of characters for which the class will match.

These character class shortcuts are very useful for verifying input which your programs may receive, from either a command line interface, or through CGI scripts running on a web server. As a very simple example, suppose we are expecting a username which contains only letters, numbers or the underscore symbol. We can easily print a message advising of an illegal entry with a program fragment like:

 print "Illegal username entered\n" if $username =~ /\W/;

Finally, a very useful escape sequence for constructing regular expressions is \b. Inside a character class, \b has it's usually meaning as a backspace, but outside of a character class, \b is matched only at a word boundary. This eliminates the need to individually check for all the different places that a word might be (surrounded by spaces, at the beginning or end of a sentence, followed by punctuation or newline, etc.). Similar to the other regular expression escape sequences, \B is matched by anything except a word boundary.


Table: Modifiers for Regular Expressions
Modifier Meaning
^ anchors expression to beginning of target
$ anchors expression to end of target
. matches any single character except newline
| separates alternative patterns
() groups patterns together
* matches 0 or more occurrences of preceding entity
? matches 0 or 1 occurrences of preceding entity
+ matches 1 or more occurrences of preceding entity
{n} matches exactly n occurrences of preceding enti ty
{n,} matches at least n occurrences of preceding entity
{n,m} matches between n and m occurrences



Table: Escape sequences for character classes
Symbol Matches Symbol Matches
\w Alphanumerics and _ \W Non-alphanumerics
\d Digits \D Non-digits
\s Whitespace \S Non-whitespace



next up previous contents
Next: Options to the Regular Up: Regular Expressions Previous: Operators to Work with   Contents
Phil Spector 2002-10-18