$1
, the next in $2
, and so on.
One word of caution: these special variables are wiped out each time another
regular expression is evaluated, so if you plan to use them in your program, it is
very prudent to immediately copy them into a regular variable. If you are using the
debugger to test your regular expression, you'll need to store any tagged values in
a regular variable on the same line of input as your regular expression by separating
the assignment from the regular expression operator with a semicolon.
As a simple example, suppose we wish to extract the names of all the images referred to in a web page. These images will be found in strings like:
<img src="/images/back.gif">Since HTML is case-insensitive, and allows for unlimited whitespace between elements, we must construct our regular expression so that it will still work, even if the HTML looks like, say:
<IMG SRC = "/images/back.gif">or
< IMG SRC ='/images/back.gif'>In words, what we are looking for is the left angle bracket, followed by zero or more blanks, followed by the string ``IMG'', followed by one or more blanks, followed by the string ``src'', followed by zero or more blanks, followed by an equal sign, followed by zero or more blanks, followed by either a single or double quote, the string we want, and finally a closing single or double quote. Each individual part is easily represented in a regular expression, and we simply need to put them together. We'll use the
i
modifier to take care of
the issue of case, and we'll tag the actual image name, and store it in the variable
$imagename
for later use.
/< *img +src *= *["']([^"']+)["']/i; $imagename = $1;In this example, we're assuming that the text to be searched for the image tags is in the default variable $_. The tagged expression will contain all the characters between the opening quote and the closing quote, but not the quotes themselves. If the match was not successful, then the variable
$1
, and hence $imagename
will have the
value undef
. When applied to any of the sample html fragments above, the
variable $imagename
will contain the value ``/images/back.gif''.
You can also use tagged expressions within a regular expression,
but in this case you should refer to the variables using a backslash (\) instead
of a dollar sign, i.e. \1, \2, etc.). Suppose we are looking for
a program to find occurences of two identical words in a row in some text. It's easy
to recognize a word in perl; it's simply a word boundary followed by one or more
alphanumeric characters, followed by a word boundary: \b\w+\b
. By tagging
the word itself, and then refering to it as \1 on the left-hand side of the
substitute, we can find the double occurences:
if(/\b(\w+)\b\1\b/){ print "Duplicate word ($1) found in line $.\n"); }
Sometimes it's useful to group parts of regular expressions together, even though
we don't wish to incur the additional overhead of tagging the expressions for
later retrieval. In cases like this, you can use the three character sequence
(?:
as the opening
parentheses, and the usual )
as the closing parentheses. Such groupings do
not effect the special variables $1
, $2
, etc.