Regular Expressions
1 Tagging and Backreferences
Consider again the problem of looking for email addresses. The regular
expression that we wrote is exactly what we want, because we don't care
what's surrounding the email address. But in many cases, the only way
we can find what we want is to specify the surroundings of what we're
looking for. Suppose we wish to write a program that will find all of
the links (URLs that can be reached by clicking some text on the page)
of a web page. A line containing a link may look something like this:
<a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a><br />
Finding the links is very easy; but our goal here is to extract
the links themselves. Notice that there's no regular expression that can
match just the link; we need to use some information about the context in
which it's found, and when we extract the matched expression there will be
extra characters that we really don't want. To handle this problem,
parentheses can be used to surround parts of a regular expression that we're
really interested in, and tools exist to help us get those parts separated
from the overall expression. In R, the only functions that can deal with
these tagged expressions are sub and gsub, so to take
take advantage of them, you may have to first extract the regular expressions
with the methods we've already seen, and then apply sub or
gsub. To illustrate, let's compose a simple regular expression
to find links.
I don't need to worry about the case of the regular expression, because the grep,
sub, gsub and gregexpr functions all
support the ignore.case= argument.
Notice that I've surrounded the part we want ([^"'>]+) in parentheses.
This will allow me to refer to this tagged expression as \1 in a
call to gsub. (Additional tagged expressions will be refered to
as \2, \3, etc.)
Using this
pattern, we can first find all the chunks of text that have our links
embedded in them, and then use gsub to change the entire piece to
just the part we want:
> link = '<a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a><br />'
> gregout = gregexpr('href *= *["\']?([^"\'>]+)["\']? *>',link,ignore.case=TRUE)
> thematch = mapply(getexpr,link,gregout)
> answer = gsub('href *= *["\']?([^"\'>]+)["\']? *>','\\1',thematch,ignore.case=TRUE)
> names(answer) = NULL
> answer
[1] "http://www.stat.berkeley.edu"
2 Getting Text into R
Up until now, we've been working with text that's already been formatted to make
it easy for R to read, and scan or read.table (and its associated
wrapper functions) have been sufficient to take care of things. Now we want to
treat our input as raw text; in other words, we don't want R to assume that the data
is in any particular form. The main function for taking care of this in R is
readLines. In the simplest form, you pass readLines the name
of a URL or file that you want to read, and it returns a character vector with one
element for each line of the file or url, containing the contents of each line.
An optional argument to readLines specifies the number of lines you want
to read; by default it's set to -1, which means to read all available
lines.
readLines removes the newline at the end of each line, but otherwise returns the text exactly the way it was found in the file or URL.
readLines also accepts connections, which are objects in R that represent
an alternative form of input, such as a pipe or a zipped file. Take a look at
the help file for connections for more information on this capability.
For moderate-sized problems that aren't too complex, using readLines in
its default mode (reading all the lines of input into a vector of character
strings) will usually be the best way to solve your problem, because most of the
functions you'll use with such data are vectorized, and can operate on every line
at once.
As a simple example,
suppose we wanted to get
the names of all the files containing notes for this class. A glance at the
page http://www.stat.berkeley.edu/~spector/s133/schedule.html
indicates that all the online notes can be found on lines like this one:
<tr><td> Jan 23 </td><td> <a href="Unix.html">Introduction to UNIX</a></td></tr>
We can easily extract the names of the note files using the
sub function (since there is only one link per line, we don't
need to use gsub, although we could).
The first step is to create a vector of character strings that will represents
the lines of the URL we are trying to read. We can simply pass the URL name
to readLines:
> x = readLines('http://www.stat.berkeley.edu/~spector/s133/schedule.html')
Next, we can write a regular expression that can find the links.
Note that the pattern that we want (i.e. the name of the file referenced
in the link) has been tagged with parentheses for later extraction.
By using
the caret (^) and dollar sign ($) we can describe our pattern
as an entire line - when we substitute the tagged expression for the
pattern we'll have just what we want. I'll also add the basename of the
files so that they could be, for example, entered into a browser.
> baseurl = 'http://www.stat.berkeley.edu/~spector/s133'
> linkpat = '^.*<td> *<a href=["\'](.*)["\']>.*$'
> x = readLines('http://www.stat.berkeley.edu/~spector/s133/schedule.html')
> y = grep(linkpat,x,value=TRUE)
> paste(baseurl,sub(linkpat,'\\1',y),sep='/')
[1] "http://www.stat.berkeley.edu/~spector/s133/Intro.html"
[2] "http://www.stat.berkeley.edu/~spector/s133/OS.html"
[3] "http://www.stat.berkeley.edu/~spector/s133/Unix.html"
[4] "http://www.stat.berkeley.edu/~spector/s133/R-1a.html"
[5] "http://www.stat.berkeley.edu/~spector/s133/R-2a.html"
[6] "http://www.stat.berkeley.edu/~spector/s133/R-3a.html"
. . .
File translated from
TEX
by
TTH,
version 3.67.
On 23 Feb 2011, 08:49.