Character Manipulation and Regular Expressions

1 Working with Characters

As you probably noticed when looking at the above functions, they are very simple, and, quite frankly, it's hard to see how they could really do anything complex on their own. In fact, that's just the point of these functions - they can be combined together to do just about anything you would want to do. As an example, consider the task of capitalizing the first character of each word in a string. The toupper function can change the case of all the characters in a string, but we'll need to do something to separate out the characters so we can get the first one. If we call strsplit with an empty string for the splitting character, we'll get back a vector of the individual characters:

> str = 'sherlock holmes'
> letters = strsplit(str,'')
> letters
[[1]]
 [1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s"
> theletters = letters[[1]]

Notice that strsplit always returns a list. This will be very useful later, but for now we'll extract the first element before we try to work with its output.

The places that we'll need to capitalize things are the first position in the vector or letters, and any letter that comes after a blank. We can find those positions very easily:

> wh = c(1,which(theletters == ' ') + 1)
> wh
[1]  1 10

We can change the case of the letters whose indexes are in wh, then use paste to put the string back together.

> theletters[wh] = toupper(theletters[wh])
> paste(theletters,collapse='')
[1] "Sherlock Holmes"

Things have gotten complicated enough that we could probably stand to write a function:

maketitle = function(txt){
  theletters = strsplit(txt,'')[[1]]
  wh = c(1,which(theletters  == ' ') + 1)
  theletters[wh] = toupper(theletters[wh])
  paste(theletters,collapse='')
}

Of course, we should always test our functions:

> maketitle('some crazy title')
[1] "Some Crazy Title"

Now suppose we have a vector of strings:

> titls = c('sherlock holmes','avatar','book of eli','up in the air')

We can always hope that we'll get the right answer if we just use our function:

> maketitle(titls)
[1] "Sherlock Holmes"

Unfortunately, it didn't work in this case. Whenever that happens, sapply will operate on all the elements in the vector:

> sapply(titls,maketitle)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air"

Of course, this isn't the only way to solve the problem. Rather than break up the string into individual letters, we can break it up into words, and capitalize the first letter of each, then combine them back together. Let's explore that approach:

> str = 'sherlock holmes'
> words = strsplit(str,' ')
> words
[[1]]
[1] "sherlock" "holmes"

Now we can use the assignment form of the substring function to change the first letter of each word to a capital. Note that we have to make sure to actually return the modified string from our call to sapply, so we insure that the last statement in our function returns the string:

> sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
  sherlock     holmes 
"Sherlock"   "Holmes"

Now we can paste the pieces back together to get our answer:

> res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
> paste(res,collapse=' ')
[1] "Sherlock Holmes"

To operate on a vector of strings, we'll need to incorporate these steps into a function, and then call sapply:

mktitl = function(str){
   words = strsplit(str,' ')
   res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w})
   paste(res,collapse=' ')
}

We can test the function, making sure to use a string different than the one we used in our initial test:

> mktitl('some silly string')
[1] "Some Silly String"

And now we can test it on the vector of strings:

> titls = c('sherlock holmes','avatar','book of eli','up in the air')
> sapply(titls,mktitl)
  sherlock holmes            avatar       book of eli     up in the air 
"Sherlock Holmes"          "Avatar"     "Book Of Eli"   "Up In The Air"

How can we compare the two methods? The R function system.time will report the amount of time any operation in R uses. One important caveat - if you wish to assign an expression to a value in the system.time call, you must use the "<-" assignment operator, because the equal sign will confuse the function into thinking you're specifying a named parameter in the function call. Let's try system.time on our two functions:

> system.time(one <- sapply(titls,maketitle))
   user  system elapsed 
  0.000   0.000   0.001 
> system.time(two <- sapply(titls,mktitl))
   user  system elapsed 
  0.000   0.000   0.002

For such a tiny example, we can't really trust that the difference we see is real. Let's use the movie names from a previous example:

> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> nms = tolower(movies$name)
> system.time(one <- sapply(nms,maketitle))
   user  system elapsed 
  0.044   0.000   0.045 
> system.time(two <- sapply(nms,mktitl))
   user  system elapsed 
  0.256   0.000   0.258

It looks like the first method is better than the second. Of course, if they don't get the same answer, it doesn't really matter how fast they are. In R, the all.equal function can be used to see if things are the same:

> all.equal(one,two)
[1] TRUE

2 Regular Expressions

Regular expressions are a method of describing patterns in text that's far more flexible than using ordinary character strings. While an ordinary text string is only matched by an exact copy of the string, regular expressions give us the ability to describe what we want in more general terms. For example, while we couldn't search for email addresses in a text file using normal searches (unless we knew every possible email address), we can describe the general form of an email address (some characters followed by an "@" sign, followed by some more characters, a period, and a few more characters. through regular expressions, and then find all the email addresses in the document very easily.

Another handy feature of regular expressions is that we can "tag" parts of an expression for extraction. If you look at the HTML source of a web page (for example, by using View->Source in a browser, or using download.file in R to make a local copy), you'll notice that all the clickable links are represented by HTML like:

<a href="http://someurl.com/somewhere">

It would be easy to search for the string href= to find the links, but what if some webmasters used something like

<a href  = 'http://someurl.com/somewhere'>

Now a search for href= won't help us, but it's easy to express those sorts of choices using regular expressions.

There are a lot of different versions of regular expressions in the world of computers, and while they share the same basic concepts and much of the same syntax, there are irritating differences among the different versions. If you're looking for additional information about regular expressions in books or on the web, you should know that, in addition to basic regular expresssions, recent versions of R also support perl-style regular expressions. (perl is a scripting language whose creator, Larry Wall, developed some attractive extensions to the basic regular expression syntax.) Some of the rules of regular expressions are laid out in very terse language on the R help page for regex and regexpr. Since regular expressions are a somewhat challenging topic, there are many valuable resources on the internet.

Before we start, one word of caution. We'll see that the way that regular expressions work is that they take many of the common punctuation symbols and give them special meanings. Because of this, when you want to refer to one of these symbols literally (that is, as simply a character like other characters), you need to precede those symbols with a backslash (\). But backslashes already have a special meaning in character strings; they are used to indicate control characters, like tab (\t), and newline (\n). The upshot of this is that when you want to type a backslash to keep R from misinterpreting certain symbols, you need to precede it with two backslashes in the input. By the way, the characters for which this needs to be done are:

 
        . ^ $ + ? * ( ) [ ] { } | \

To reiterate, when any of these characters is to be interpreted literally in a regular expression, they must be preceded by two backslashes when they are passed to functions in R. If you're not sure, it's almost always safe to add a backslash (by typing two backslashes) in front of a character - if it's not needed, it will be ignored.

Regular expressions are constructed from three different components: literal characters, which will only be matched by the identical literal character, character classes, which are matched by more than one characters, and modifiers which operate on characters, character classes, or combinations of the two. A character class consists of an opening square bracket ([), one or more characters, and a closing square bracket (]), and is matched by any of the characters between the brackets. If the first character inside the brackets is the caret (^), then the character class is matched by anything except the other characters between the brackets. Ranges are allowed in character classes and one of the most important examples is [0-9], a character class matched by any of the digits from 0 to 9. Other ranges of characters include [a-z], all the lower case letters, and [A-Z], all the uppercase letters.

There are also some shortcuts for certain character classes that you may or may not find useful. They're summarized in the following table.

Symbol	Matches	Symbol	Matches

\w	Alphanumerics and _	\W	Non-alphanumerics
\d	Digits	\D	Non-digits
\s	Whitespace	\S	Non-whitespace

Like other cases in R where backslashes need to be interpreted literally, we have to include two backslashes to use the above shortcuts.

File translated from T_EX by T_TH, version 3.67.
On 9 Feb 2011, 15:40.