Regular Expressions

1 Tagging and Backreferences

Consider again the problem of looking for email addresses. The regular expression that we wrote is exactly what we want, because we don't care what's surrounding the email address. But in many cases, the only way we can find what we want is to specify the surroundings of what we're looking for. Suppose we wish to write a program that will find all of the links (URLs that can be reached by clicking some text on the page) of a web page. A line containing a link may look something like this:

<a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a><br />

Finding the links is very easy; but our goal here is to extract the links themselves. Notice that there's no regular expression that can match just the link; we need to use some information about the context in which it's found, and when we extract the matched expression there will be extra characters that we really don't want. To handle this problem, parentheses can be used to surround parts of a regular expression that we're really interested in, and tools exist to help us get those parts separated from the overall expression. In R, the only functions that can deal with these tagged expressions are sub and gsub, so to take take advantage of them, you may have to first extract the regular expressions with the methods we've already seen, and then apply sub or gsub. To illustrate, let's compose a simple regular expression to find links. I don't need to worry about the case of the regular expression, because the grep, sub, gsub and gregexpr functions all support the ignore.case= argument. Notice that I've surrounded the part we want ([^"'>]+) in parentheses. This will allow me to refer to this tagged expression as \1 in a call to gsub. (Additional tagged expressions will be refered to as \2, \3, etc.) Using this pattern, we can first find all the chunks of text that have our links embedded in them, and then use gsub to change the entire piece to just the part we want:

> link = '<a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a><br />'
> gregout = gregexpr('href *= *["\']?([^"\'>]+)["\']? *>',link,ignore.case=TRUE)
> thematch = mapply(getexpr,link,gregout)
> answer = gsub('href *= *["\']?([^"\'>]+)["\']? *>','\\1',thematch,ignore.case=TRUE)
> names(answer) = NULL
> answer
[1] "http://www.stat.berkeley.edu"

2 Getting Text into R

Up until now, we've been working with text that's already been formatted to make it easy for R to read, and scan or read.table (and its associated wrapper functions) have been sufficient to take care of things. Now we want to treat our input as raw text; in other words, we don't want R to assume that the data is in any particular form. The main function for taking care of this in R is readLines. In the simplest form, you pass readLines the name of a URL or file that you want to read, and it returns a character vector with one element for each line of the file or url, containing the contents of each line. An optional argument to readLines specifies the number of lines you want to read; by default it's set to -1, which means to read all available lines. readLines removes the newline at the end of each line, but otherwise returns the text exactly the way it was found in the file or URL.

readLines also accepts connections, which are objects in R that represent an alternative form of input, such as a pipe or a zipped file. Take a look at the help file for connections for more information on this capability.

For moderate-sized problems that aren't too complex, using readLines in its default mode (reading all the lines of input into a vector of character strings) will usually be the best way to solve your problem, because most of the functions you'll use with such data are vectorized, and can operate on every line at once. As a simple example, suppose we wanted to get the names of all the files containing notes for this class. A glance at the page http://www.stat.berkeley.edu/classes/s133/schedule.html indicates that all the online notes can be found on lines like this one:

<tr><td> Jan 23 </td><td> <a href="Unix.html">Introduction to UNIX</a></td></tr>

We can easily extract the names of the note files using the sub function (since there is only one link per line, we don't need to use gsub, although we could).

The first step is to create a vector of character strings that will represents the lines of the URL we are trying to read. We can simply pass the URL name to readLines:

> x = readLines('http://www.stat.berkeley.edu/classes/s133/schedule.html')

Next, we can write a regular expression that can find the links. Note that the pattern that we want (i.e. the name of the file referenced in the link) has been tagged with parentheses for later extraction. By using the caret (^) and dollar sign ($) we can describe our pattern as an entire line - when we substitute the tagged expression for the pattern we'll have just what we want. I'll also add the basename of the files so that they could be, for example, entered into a browser.

> baseurl = 'http://www.stat.berkeley.edu/classes/s133'
> linkpat = '^.*<td> *<a href=["\'](.*)["\']>.*$'
> x = readLines('http://www.stat.berkeley.edu/classes/s133/schedule.html')
> y = grep(linkpat,x,value=TRUE)
> paste(baseurl,sub(linkpat,'\\1',y),sep='/')
 [1] "http://www.stat.berkeley.edu/classes/s133/Intro.html"   
 [2] "http://www.stat.berkeley.edu/classes/s133/OS.html"      
 [3] "http://www.stat.berkeley.edu/classes/s133/Unix.html"    
 [4] "http://www.stat.berkeley.edu/classes/s133/R-1a.html"    
 [5] "http://www.stat.berkeley.edu/classes/s133/R-2a.html"    
 [6] "http://www.stat.berkeley.edu/classes/s133/R-3a.html"    
                      . . .

File translated from T_EX by T_TH, version 3.67.
On 23 Feb 2011, 08:49.