Regular Expressions
1 Regular Expressions
The period (.) represents the wildcard character. Any character
(except for the newline character) will be matched by a period in a
regular expression; when you literally want a period in a regular expression
you need to precede it with a backslash.
Many times you'll need to express the idea of the beginning or end of a line
or word
in a regular expression. For example, you may be looking for a line number
at the beginning of a line, or only be interested in searching for a
string if it's at the end of a line. The carat (^) represents
the beginning of a line in a regular expression, and the dollar sign
($) represents the end of a line. To specify that the regular
expression you're looking for is a word (i.e. it's surrounded by white
space, punctuation or line beginnings and endings), surround the regular expression
with escaped angle brackets (\< and \>).
One of the most powerful features in regular expressions is being able to
specify that you're interested in patterns which may be variable.
In the previous example regarding extracting links through the href=
argument, we'd like to be able to find a reference no matter how many spaces
there are between the href and the =. We also might want
to account for the fact that quotes around the url are sometimes omitted.
There are a few characters that act as modifiers to the part of the regular
expression that precedes them, summarized in the table below.
Modifier | Meaning |
* | zero or more |
? | zero or one |
+ | one or more |
{n} | exactly n occurences |
{n,} | at least n occurences |
{n,m} | between n and m occurences |
The first three modifiers in the above table are the most important ones;
the others are used much less often. Let's return to the idea of looking
for email addresses. Here's a regular expression that will match most
email addresses: [-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\.[A-Za-z]+ .
In words we could read the regular expression as "one or more occurences
of the characters between the brackets literally followed by an @-sign,
followed by one or more characters between the brackets, literally followed by
a period, and completed by one or more letters from among A-Z and a-z. You
can simplify things by specifying ignore.case=TRUE to most R
functions that deal with regular expressions; in that case you would only
have to put either A-Z or a-z in the character class, and
it would still match both upper or lower case letters.
If we use the regular expression in a call to grep, it will find all of
the elements in a character vector that contain an email address:
> chk = c('abc noboby@stat.berkeley.edu','text with no email',
+ 'first me@mything.com also you@yourspace.com')
> grep('[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+',chk)
[1] 1 3
Since regular expressions in R are simply character strings,
we can save typing by storing regular expressions in variables. For
example, if we say:
> emailpat = '[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+'
then we can use the R variable emailpat in place of the
full regular expression. (If you use this technique, be sure to modify your
stored variable when you change your regular expression.)
To actually get to the regular expressions, we can use the
gregexpr function, which provides more information about regular
expression matches. First, let's see what the output from gregexpr
looks like:
> gregout = gregexpr(emailpat,chk)
> gregout
[[1]]
[1] 5
attr(,"match.length")
[1] 24
[[2]]
[1] -1
attr(,"match.length")
[1] -1
[[3]]
[1] 7 27
attr(,"match.length")
[1] 14 17
First, notice that, since there may be a different number of
regular expressions found in different lines, gregexpr returns
a list. In each list element is a vector of starting points where
regular expressions were found in the corresponding input string. In addition,
there is additional information stored as an attribute, which is part of
the value, but which doesn't interfere if we try to treat the value
as if it was simply a vector. The match.length attribute is another
vector, of the same length as the vector of starting points, telling us
how long each match was. Concentrating on the first element, we can use
the substring function to extract the actual address as follows:
> substring(chk[1],gregout[[1]],gregout[[1]] + attr(gregout[[1]],'match.length') - 1)
[1] "noboby@stat.berkeley.edu"
To make it a little easier to use, let's make a function that
will do the extraction for us:
getexpr = function(s,g)substring(s,g,g + attr(g,'match.length') - 1)
Now it's a little easier to get what we're looking for:
> getexpr(chk[2],gregout[[2]])
[1] ""
> getexpr(chk[3],gregout[[3]])
[1] "me@mything.com" "you@yourspace.com"
To use the same idea on an entire vector of character strings, we
could either write a loop, or use the mapply function. The
mapply function will repeatedly call a function of your choice,
cycling through the elements in as many vectors as you provide to the
function. To use our getexpr function with mapply to
extract all of the email addresses in the chk vector, we could
write:
> emails = mapply(getexpr,chk,gregout)
> emails
$"abc noboby@stat.berkeley.edu"
[1] "noboby@stat.berkeley.edu"
$"text with no email"
[1] ""
$"first me@mything.com also you@yourspace.com"
[1] "me@mything.com" "you@yourspace.com"
Notice that mapply uses the text of the original
character strings as names for the list it returns; this may or may
not be useful. To remove the names, use the assignment form of the
names function to set the names to NULL
> names(emails) = NULL
> emails
[[1]]
[1] "noboby@stat.berkeley.edu"
[[2]]
[1] ""
[[3]]
[1] "me@mything.com" "you@yourspace.com"
The value that mapply returns is a list, the same length
as the vector of input strings (chk in this example), with an
empty string where there were no matches. If all you wanted was a vector
of all the email addresses, you could use the unlist function:
> unlist(emails)
[1] "noboby@stat.berkeley.edu" ""
[3] "me@mything.com" "you@yourspace.com"
The empty strings can be removed in the usual way:
emails = emails[emails != '']
or
emails = subset(emails,emails != '')
Suppose we wanted to know how many emails there were in each line of the
input text (chk). One idea that might make sense is to find the
length of each element of the list that the getexpr function
returned:
> emails = mapply(getexpr,chk,gregout)
> names(emails) = NULL
> sapply(emails,length)
The problem is that, in order to maintain the structure of
the output list, mapply put an empty (length 0) string in the
second position of the list, so that length sees at least one
string in each element of the list. The solution is to write a function
that modifies the length function so that it only returns the
length if there are some characters in the strings for a particular list
element. (We can safely do this since we've already seen that there will
always be at least one element in the list. We can use the if
statement to do this:
> sapply(emails,function(e)if(nchar(e[1]) > 0)length(e) else 0)
[1] 1 0 2
2 How matches are found
Regular expressions are matched by starting at the beginning of a string and
seeing if a possible match might begin there. If not, the next character in
the string is examined, and so on; if the end of the string is reached, then
no match is reported.
Let's consider the case where there is a potential match. The regular
expression program remembers where the beginning of the match was and starts
checking the characters to the right of that location. As long as the
expression continues to be matched, it will continue, adding more characters
to the matched pattern until it reaches a point in the string where the
regular expression is no longer matched. At that point, it backs up until
things match again, and it checks to see if the entire regular expression
has been matched. If it has, it reports a match; otherwise it reports
no match.
While the specifics of this mechanism will rarely concern you when you're
doing regular expression matches, there is one important point that you
should be aware of. The regular expression program is always going to try
to find the longest match possible. This means if you use the wildcard
character, ., with the "one or more" modifier, *, you may
get more than you expected.
Suppose we wish to remove HTML markup from a web page, in order to extract
the information that's on the page. All HTML markup begins with a
left angle bracket (<) and ends with a right angle bracket (>),
and has markup information in between.
Before diving in and trying to remove the markup, let's make sure we can
find it correctly.
An obvious (but ill-advised) choice for the regular expression is
<.*>. In words it means "a literal left angle bracket, followed by
one or more of any character, followed by a literal right angle bracket".
Let's see how effective it is:
> htmls = c('This is an image: <IMG SRC="img/rxb.gif">',
+ '<a href="somepage.html">click here</a>')
> grep('<.*>',htmls)
[1] 1 2
We've matched both pieces, but what have we really got? We can apply our
mapply-based solution to see the actual strings that were matched:
> getexpr = function(s,g)substring(s,g,g + attr(g,'match.length') - 1)
> matches = mapply(getexpr,htmls,gregexpr('<.*>',htmls))
> names(matches) = NULL
> matches
[1] "<IMG SRC=\"img/rxb.gif\">"
[2] "<a href=\"somepage.html\">click here</a>"
The first match worked fine, but the second match went right past
the end of the first piece of markup, and included the second piece in it.
For this reason, regular expressions are said to be greedy - they will always
try to find the longest pattern that satisfies the match. Fixing problems
like this is pretty easy - we just have to think about what we really want to
match. In this case, we don't want zero or more occurences of anything; what
we want is zero or more occurences of anything except a right angle
bracket. As soon as we see the first right angle bracket after the left
angle bracket we want to stop. Fortunately, it's very easy to express ideas
like this with negated character classes. In this case, we simply replace
the period with [^>]:
> matches = mapply(getexpr,htmls,gregexpr('<[^>]*>',htmls))
> names(matches) = NULL
> matches
[[1]]
[1] "<IMG SRC=\"img/rxb.gif\">"
[[2]]
[1] "<a href=\"somepage.html\">" "</a>"
The two pieces of markup in the second element of htmls are now
correctly found.
Another way to solve the problem of greedy regular expressions is to use
a feature of regular expressions invented by Larry Wall, the creator of perl.
His idea was to use the question mark (?) after an asterisk (*)
or plus sign (+) to indicate that you want a non-greedy match,
that is, to search for the smallest string which will match the pattern.
Recent versions of R have incorporated these features into their regular
expression support. So
an alternative to the solution shown above is:
> matches = mapply(getexpr,htmls,gregexpr('<.*?>',htmls))
File translated from
TEX
by
TTH,
version 3.67.
On 25 Feb 2011, 08:58.