Character Manipulation and Regular Expressions
1 Working with Characters
As you probably noticed when looking at the above functions, they are very simple,
and, quite frankly, it's hard to see how they could really do anything complex on their
own. In fact, that's just the point of these functions - they can be combined together
to do just about anything you would want to do. As an example, consider the task of
capitalizing the first character of each word in a string. The toupper function
can change the case of all the characters in a string, but we'll need to do something
to separate out the characters so we can get the first one. If we call strsplit
with an empty string for the splitting character, we'll get back a vector of the individual
characters:
> str = 'sherlock holmes'
> letters = strsplit(str,'')
> letters
[[1]]
[1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s"
> theletters = letters[[1]]
Notice that strsplit always returns a list. This will be very
useful later, but for now we'll extract the first element before
we try to work with its output.
The places that we'll need to capitalize things are the first position in the vector
or letters, and any letter that comes after a blank. We can find those positions
very easily:
> wh = c(1,which(theletters == ' ') + 1)
> wh
[1] 1 10
We can change the case of the letters whose indexes are in wh,
then use paste to put the string back together.
> theletters[wh] = toupper(theletters[wh])
> paste(theletters,collapse='')
[1] "Sherlock Holmes"
Things have gotten complicated enough that we could probably stand to
write a function:
maketitle = function(txt){
theletters = strsplit(txt,'')[[1]]
wh = c(1,which(theletters == ' ') + 1)
theletters[wh] = toupper(theletters[wh])
paste(theletters,collapse='')
}
Of course, we should always test our functions:
> maketitle('some crazy title')
[1] "Some Crazy Title"
Now suppose we have a vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')
We can always hope that we'll get the right answer if we just use
our function:
> maketitle(titls)
[1] "Sherlock Holmes"
Unfortunately, it didn't work in this case. Whenever that happens,
sapply will operate on all the elements in the vector:
> sapply(titls,maketitle)
sherlock holmes avatar book of eli up in the air
"Sherlock Holmes" "Avatar" "Book Of Eli" "Up In The Air"
Of course, this isn't the only way to solve the problem. Rather than break
up the string into individual letters, we can break it up into words, and
capitalize the first letter of each, then combine them back together. Let's
explore that approach:
> str = 'sherlock holmes'
> words = strsplit(str,' ')
> words
[[1]]
[1] "sherlock" "holmes"
Now we can use the assignment form of the substring
function to change the first letter of each word to a capital. Note that
we have to make sure to actually return the modified string from our call
to sapply, so we insure that the last statement in our function
returns the string:
> sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
sherlock holmes
"Sherlock" "Holmes"
Now we can paste the pieces back together to get our answer:
> res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
> paste(res,collapse=' ')
[1] "Sherlock Holmes"
To operate on a vector of strings, we'll need to incorporate
these steps into a function, and then call sapply:
mktitl = function(str){
words = strsplit(str,' ')
res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
paste(res,collapse=' ')
}
We can test the function, making sure to use a string different than
the one we used in our initial test:
> mktitl('some silly string')
[1] "Some Silly String"
And now we can test it on the vector of strings:
> titls = c('sherlock holmes','avatar','book of eli','up in the air')
> sapply(titls,mktitl)
sherlock holmes avatar book of eli up in the air
"Sherlock Holmes" "Avatar" "Book Of Eli" "Up In The Air"
How can we compare the two methods? The R function system.time will
report the amount of time any operation in R uses. One important caveat -
if you wish to assign an expression to a value in the system.time
call, you must use the "<-" assignment operator, because the
equal sign will confuse the function into thinking you're specifying a
named parameter in the function call. Let's try system.time on
our two functions:
> system.time(one <- sapply(titls,maketitle))
user system elapsed
0.000 0.000 0.001
> system.time(two <- sapply(titls,mktitl))
user system elapsed
0.000 0.000 0.002
For such a tiny example, we can't really trust that the difference
we see is real. Let's use the movie names from a previous example:
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> nms = tolower(movies$name)
> system.time(one <- sapply(nms,maketitle))
user system elapsed
0.044 0.000 0.045
> system.time(two <- sapply(nms,mktitl))
user system elapsed
0.256 0.000 0.258
It looks like the first method is better than the second. Of
course, if they don't get the same answer, it doesn't really matter how
fast they are. In R, the all.equal function can be used to
see if things are the same:
> all.equal(one,two)
[1] TRUE
2 Regular Expressions
Regular expressions are a method of describing patterns in text that's
far more flexible than using ordinary character strings. While an
ordinary text string is only matched by an exact copy of the string,
regular expressions give us the ability to describe what we want in more
general terms. For example, while we couldn't search for email addresses
in a text file using normal searches (unless we knew every possible email
address), we can describe the general form of an email address (some characters
followed by an "@" sign, followed by some more characters, a period,
and a few more characters.
through regular expressions,
and then find all the email addresses in the document very easily.
Another handy feature of regular expressions is that we can "tag" parts of
an expression for extraction. If you look at the HTML source of a web
page (for example, by using View->Source in a browser, or using
download.file in R to make a local copy), you'll notice that all
the clickable links are represented by HTML like:
<a href="http://someurl.com/somewhere">
It would be easy to search for the string href= to find the links,
but what if some webmasters used something like
<a href = 'http://someurl.com/somewhere'>
Now a search for href= won't help us, but it's easy to
express those sorts of choices using regular expressions.
There are a lot of different versions of regular expressions in the world of
computers, and while they share the same basic concepts and much of the
same syntax, there are irritating differences among the different versions.
If you're looking for additional information about regular expressions in
books or on the web, you should know that, in addition to basic regular
expresssions, recent versions of R also support perl-style regular expressions.
(perl is a scripting language whose
creator, Larry Wall, developed some attractive extensions to the basic
regular expression syntax.) Some of the rules of regular expressions are
laid out in very terse language on the R help page for regex and
regexpr. Since regular
expressions are a somewhat challenging topic, there are many valuable resources
on the internet.
Before we start, one word of caution. We'll see that the way that regular
expressions work is that they take many of the common punctuation symbols and
give them special meanings. Because of this, when you want to refer to one
of these symbols literally (that is, as simply a character like other characters),
you need to precede those symbols with a backslash (\). But backslashes
already have a special meaning in character strings; they are used to indicate
control characters, like tab (\t), and newline (\n). The upshot
of this is that when you want to type a backslash to keep R from misinterpreting
certain symbols, you need to precede it with two backslashes in the
input. By the way, the characters for which this needs to be done are:
. ^ $ + ? * ( ) [ ] { } | \
To reiterate, when any of these characters is to be interpreted literally in
a regular expression, they must be preceded by two backslashes when they are
passed to functions in R. If you're not sure, it's almost always safe to
add a backslash (by typing two backslashes) in front of a character - if
it's not needed, it will be
ignored.
Regular expressions are constructed from three different components:
literal characters, which will only be matched by the identical literal
character, character classes, which are matched by more than one characters,
and modifiers which operate on characters, character classes, or combinations
of the two. A character class consists of an opening square bracket ([),
one or more characters, and a closing square bracket (]), and is
matched by any of the characters between the brackets. If the first character
inside the brackets is the caret (^), then the character class is
matched by anything except the other characters between the brackets.
Ranges are allowed in character classes and one of the most important examples
is [0-9], a character class matched by any of the digits from 0 to 9.
Other ranges of characters include [a-z], all the lower case letters,
and [A-Z], all the uppercase letters.
There are also some shortcuts for certain character classes that you may
or may not find useful. They're summarized in the following table.
Symbol | Matches | Symbol | Matches |
|
\w | Alphanumerics and _ | \W | Non-alphanumerics |
\d | Digits | \D | Non-digits |
\s | Whitespace | \S | Non-whitespace |
Like other cases in R where backslashes need to be interpreted literally,
we have to include two backslashes to use the above shortcuts.
File translated from
TEX
by
TTH,
version 3.67.
On 9 Feb 2011, 15:40.