Regular Expressions

1 Regular Expressions

The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.

Many times you'll need to express the idea of the beginning or end of a line or word in a regular expression. For example, you may be looking for a line number at the beginning of a line, or only be interested in searching for a string if it's at the end of a line. The carat (^) represents the beginning of a line in a regular expression, and the dollar sign ($) represents the end of a line. To specify that the regular expression you're looking for is a word (i.e. it's surrounded by white space, punctuation or line beginnings and endings), surround the regular expression with escaped angle brackets (\< and \>).

One of the most powerful features in regular expressions is being able to specify that you're interested in patterns which may be variable. In the previous example regarding extracting links through the href= argument, we'd like to be able to find a reference no matter how many spaces there are between the href and the =. We also might want to account for the fact that quotes around the url are sometimes omitted. There are a few characters that act as modifiers to the part of the regular expression that precedes them, summarized in the table below.

Modifier	Meaning
`*`	zero or more
`?`	zero or one
`+`	one or more
`{n}`	exactly n occurences
`{n,}`	at least n occurences
`{n,m}`	between n and m occurences

The first three modifiers in the above table are the most important ones; the others are used much less often. Let's return to the idea of looking for email addresses. Here's a regular expression that will match most email addresses: [-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\.[A-Za-z]+ . In words we could read the regular expression as "one or more occurences of the characters between the brackets literally followed by an @-sign, followed by one or more characters between the brackets, literally followed by a period, and completed by one or more letters from among A-Z and a-z. You can simplify things by specifying ignore.case=TRUE to most R functions that deal with regular expressions; in that case you would only have to put either A-Z or a-z in the character class, and it would still match both upper or lower case letters.

If we use the regular expression in a call to grep, it will find all of the elements in a character vector that contain an email address:

> chk = c('abc noboby@stat.berkeley.edu','text with no email',
+ 'first me@mything.com also you@yourspace.com')
> grep('[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+',chk)
[1] 1 3

Since regular expressions in R are simply character strings, we can save typing by storing regular expressions in variables. For example, if we say:

> emailpat = '[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+'

then we can use the R variable emailpat in place of the full regular expression. (If you use this technique, be sure to modify your stored variable when you change your regular expression.)

To actually get to the regular expressions, we can use the gregexpr function, which provides more information about regular expression matches. First, let's see what the output from gregexpr looks like:

> gregout = gregexpr(emailpat,chk)
> gregout
[[1]]
[1] 5
attr(,"match.length")
[1] 24

[[2]]
[1] -1
attr(,"match.length")
[1] -1

[[3]]
[1]  7 27
attr(,"match.length")
[1] 14 17

First, notice that, since there may be a different number of regular expressions found in different lines, gregexpr returns a list. In each list element is a vector of starting points where regular expressions were found in the corresponding input string. In addition, there is additional information stored as an attribute, which is part of the value, but which doesn't interfere if we try to treat the value as if it was simply a vector. The match.length attribute is another vector, of the same length as the vector of starting points, telling us how long each match was. Concentrating on the first element, we can use the substring function to extract the actual address as follows:

> substring(chk[1],gregout[[1]],gregout[[1]] + attr(gregout[[1]],'match.length') - 1)
[1] "noboby@stat.berkeley.edu"

To make it a little easier to use, let's make a function that will do the extraction for us:

getexpr = function(s,g)substring(s,g,g + attr(g,'match.length') - 1)

Now it's a little easier to get what we're looking for:

> getexpr(chk[2],gregout[[2]])
[1] ""
> getexpr(chk[3],gregout[[3]])
[1] "me@mything.com" "you@yourspace.com"

To use the same idea on an entire vector of character strings, we could either write a loop, or use the mapply function. The mapply function will repeatedly call a function of your choice, cycling through the elements in as many vectors as you provide to the function. To use our getexpr function with mapply to extract all of the email addresses in the chk vector, we could write:

> emails = mapply(getexpr,chk,gregout)
> emails
$"abc noboby@stat.berkeley.edu"
[1] "noboby@stat.berkeley.edu"

$"text with no email"
[1] ""

$"first me@mything.com also you@yourspace.com"
[1] "me@mything.com" "you@yourspace.com"

Notice that mapply uses the text of the original character strings as names for the list it returns; this may or may not be useful. To remove the names, use the assignment form of the names function to set the names to NULL

> names(emails) = NULL
> emails
[[1]]
[1] "noboby@stat.berkeley.edu"

[[2]]
[1] ""

[[3]]
[1] "me@mything.com" "you@yourspace.com"

The value that mapply returns is a list, the same length as the vector of input strings (chk in this example), with an empty string where there were no matches. If all you wanted was a vector of all the email addresses, you could use the unlist function:

> unlist(emails)
[1] "noboby@stat.berkeley.edu" ""
[3] "me@mything.com"           "you@yourspace.com"

The empty strings can be removed in the usual way:

emails = emails[emails != '']

emails = subset(emails,emails != '')

Suppose we wanted to know how many emails there were in each line of the input text (chk). One idea that might make sense is to find the length of each element of the list that the getexpr function returned:

> emails = mapply(getexpr,chk,gregout)
> names(emails) = NULL
> sapply(emails,length)

The problem is that, in order to maintain the structure of the output list, mapply put an empty (length 0) string in the second position of the list, so that length sees at least one string in each element of the list. The solution is to write a function that modifies the length function so that it only returns the length if there are some characters in the strings for a particular list element. (We can safely do this since we've already seen that there will always be at least one element in the list. We can use the if statement to do this:

> sapply(emails,function(e)if(nchar(e[1]) > 0)length(e) else 0)
[1] 1 0 2

2 How matches are found

Regular expressions are matched by starting at the beginning of a string and seeing if a possible match might begin there. If not, the next character in the string is examined, and so on; if the end of the string is reached, then no match is reported.

Let's consider the case where there is a potential match. The regular expression program remembers where the beginning of the match was and starts checking the characters to the right of that location. As long as the expression continues to be matched, it will continue, adding more characters to the matched pattern until it reaches a point in the string where the regular expression is no longer matched. At that point, it backs up until things match again, and it checks to see if the entire regular expression has been matched. If it has, it reports a match; otherwise it reports no match.

While the specifics of this mechanism will rarely concern you when you're doing regular expression matches, there is one important point that you should be aware of. The regular expression program is always going to try to find the longest match possible. This means if you use the wildcard character, ., with the "one or more" modifier, *, you may get more than you expected.

Suppose we wish to remove HTML markup from a web page, in order to extract the information that's on the page. All HTML markup begins with a left angle bracket (<) and ends with a right angle bracket (>), and has markup information in between. Before diving in and trying to remove the markup, let's make sure we can find it correctly. An obvious (but ill-advised) choice for the regular expression is <.*>. In words it means "a literal left angle bracket, followed by one or more of any character, followed by a literal right angle bracket". Let's see how effective it is:

> htmls = c('This is an image: <IMG SRC="img/rxb.gif">',
+           '<a href="somepage.html">click here</a>')
> grep('<.*>',htmls)
[1] 1 2

We've matched both pieces, but what have we really got? We can apply our mapply-based solution to see the actual strings that were matched:

> getexpr = function(s,g)substring(s,g,g + attr(g,'match.length') - 1)
> matches = mapply(getexpr,htmls,gregexpr('<.*>',htmls))
> names(matches) = NULL
> matches
[1] "<IMG SRC=\"img/rxb.gif\">"
[2] "<a href=\"somepage.html\">click here</a>"

The first match worked fine, but the second match went right past the end of the first piece of markup, and included the second piece in it. For this reason, regular expressions are said to be greedy - they will always try to find the longest pattern that satisfies the match. Fixing problems like this is pretty easy - we just have to think about what we really want to match. In this case, we don't want zero or more occurences of anything; what we want is zero or more occurences of anything except a right angle bracket. As soon as we see the first right angle bracket after the left angle bracket we want to stop. Fortunately, it's very easy to express ideas like this with negated character classes. In this case, we simply replace the period with [^>]:

> matches = mapply(getexpr,htmls,gregexpr('<[^>]*>',htmls))
> names(matches) = NULL
> matches
[[1]]
[1] "<IMG SRC=\"img/rxb.gif\">"

[[2]]
[1] "<a href=\"somepage.html\">" "</a>"

The two pieces of markup in the second element of htmls are now correctly found.

Another way to solve the problem of greedy regular expressions is to use a feature of regular expressions invented by Larry Wall, the creator of perl. His idea was to use the question mark (?) after an asterisk (*) or plus sign (+) to indicate that you want a non-greedy match, that is, to search for the smallest string which will match the pattern. Recent versions of R have incorporated these features into their regular expression support. So an alternative to the solution shown above is:

> matches = mapply(getexpr,htmls,gregexpr('<.*?>',htmls))

File translated from T_EX by T_TH, version 3.67.
On 25 Feb 2011, 08:58.