Next: Greediness of Regular Expressions Up: Regular Expressions Previous: Options to the Regular Contents

Tagging of Regular Expressions

In many situations, it's necessary to retrieve just a portion of text which matches some regular expression. For example, we may want to extract all of a character string up until the first blank, or a number which is preceded by a specified character string. In these cases, parentheses (unescaped) can be placed around the part of the pattern we wish to retrieve; each such set of parentheses will produce a special variable whose name will be a dollar sign followed by a number. The first parenthetic expression will be stored in $1, the next in $2, and so on. One word of caution: these special variables are wiped out each time another regular expression is evaluated, so if you plan to use them in your program, it is very prudent to immediately copy them into a regular variable. If you are using the debugger to test your regular expression, you'll need to store any tagged values in a regular variable on the same line of input as your regular expression by separating the assignment from the regular expression operator with a semicolon.

As a simple example, suppose we wish to extract the names of all the images referred to in a web page. These images will be found in strings like:

     <img src="/images/back.gif">

Since HTML is case-insensitive, and allows for unlimited whitespace between elements, we must construct our regular expression so that it will still work, even if the HTML looks like, say:

     <IMG   SRC = "/images/back.gif">

     < IMG  SRC ='/images/back.gif'>

In words, what we are looking for is the left angle bracket, followed by zero or more blanks, followed by the string ``IMG'', followed by one or more blanks, followed by the string ``src'', followed by zero or more blanks, followed by an equal sign, followed by zero or more blanks, followed by either a single or double quote, the string we want, and finally a closing single or double quote. Each individual part is easily represented in a regular expression, and we simply need to put them together. We'll use the i modifier to take care of the issue of case, and we'll tag the actual image name, and store it in the variable $imagename for later use.

     /< *img +src *= *["']([^"']+)["']/i;
     $imagename = $1;

In this example, we're assuming that the text to be searched for the image tags is in the default variable $_. The tagged expression will contain all the characters between the opening quote and the closing quote, but not the quotes themselves. If the match was not successful, then the variable $1, and hence $imagename will have the value undef. When applied to any of the sample html fragments above, the variable $imagename will contain the value ``/images/back.gif''.

You can also use tagged expressions within a regular expression, but in this case you should refer to the variables using a backslash (\) instead of a dollar sign, i.e. \1, \2, etc.). Suppose we are looking for a program to find occurences of two identical words in a row in some text. It's easy to recognize a word in perl; it's simply a word boundary followed by one or more alphanumeric characters, followed by a word boundary: \b\w+\b. By tagging the word itself, and then refering to it as \1 on the left-hand side of the substitute, we can find the double occurences:

    if(/\b(\w+)\b\1\b/){
       print "Duplicate word ($1) found in line $.\n");
    }

Sometimes it's useful to group parts of regular expressions together, even though we don't wish to incur the additional overhead of tagging the expressions for later retrieval. In cases like this, you can use the three character sequence (?: as the opening parentheses, and the usual ) as the closing parentheses. Such groupings do not effect the special variables $1, $2, etc.

Next: Greediness of Regular Expressions Up: Regular Expressions Previous: Options to the Regular Contents

Phil Spector 2002-10-18