Functions for Working with Characters

1 Sizes of Objects

Before we start looking at character manipulation, this is a good time to review the different functions that give us the size of an object.

length - returns the length of a vector, or the total number of elements in a matrix (number of rows times number of columns). For a data frame, returns the number of columns.
dim - for matrices and data frames, returns a vector of length 2 containing the number of rows and the number of columns. For a vector, returns NULL. The convenience functions nrow and ncol return the individual values that would be returned by dim.
nchar - for a character string, returns the number of characters in the string. Returns a vector of values when applied to a vector of character strings. For a numeric value, nchar returns the number of characters in the printed representation of the number.

2 Character Manipulation

While it's quite natural to think of data as being numbers, manipulating character strings is also an important skill when working with data. We've already seen a few simple examples, such as choosing the right format for a character variable that represents a date, or using table to tabulate the occurences of different character values for a variable. Now we're going to look at some functions in R that let us break apart, rearrange and put together character data.

One of the most important uses of character manipulation is "massaging" data into shape. Many times the data that is available to us, for example on a web page or as output from another program, isn't in a form that a program like R can easily interpret. In cases like that, we'll need to remove the parts that R can't understand, and organize the remaining parts so that R can read them efficiently.

Let's take a look at some of the functions that R offers for working with character variables:

paste The paste function converts its arguments to character before operating on them, so you can pass both numbers and strings to the function. It concatenates the arguments passed to it, to create new strings that are combinations of other strings. paste accepts an unlimited number of unnamed arguments, which will be pasted together, and one or both of the arguments sep= and collapse=. Depending on whether the arguments are scalars or vectors, and which of sep= and collapse= are used, a variety of different tasks can be performed.
1. If you pass a single argument to paste, it will return a character representation:
```
> paste('cat')
[1] "cat"
> paste(14)
[1] "14"
```
2. If you pass more than one scalar argument to paste, it will put them together in a single string, using the sep= argument to separate the pieces:
```
> paste('stat',133,'assignment')
[1] "stat 133 assignment"
```
3. If you pass a vector of character values to paste, and the collapse= argument is not NULL, it pastes together the elements of the vector, using the collapse= argument as a separator:
```
> paste(c('stat',133,'assignment'),collapse=' ')
[1] "stat 133 assignment"
```
4. If you pass more than one argument to paste, and any of those arguments is a vector, paste will return a vector as long as its' longest argument, produced by pasting together corresponding pieces of the arguments. (Remember the recycling rule which will be used if the vector arguments are of different lengths.) Here are a few examples:
```
> paste('x',1:10,sep='')
 [1] "x1"  "x2"  "x3"  "x4"  "x5"  "x6"  "x7"  "x8"  "x9"  "x10"
> paste(c('x','y'),1:10,sep='')
 [1] "x1"  "y2"  "x3"  "y4"  "x5"  "y6"  "x7"  "y8"  "x9"  "y10"
```
grep The grep function searches for patterns in text. The first argument to grep is a text string or regular expression that you're looking for, and the second argument is usually a vector of character values. grep returns the indices of those elements of the vector of character strings that contain the text string. Right now we'll limit ourselves to simple patterns, but later we'll explore the full strength of commands like this with regular expressions.
grep can be used in a number of ways. Suppose we want to see the countries of the world that have the world 'United' in their names.
```
> grep('United',world1$country) 
[1] 144 145
```
grep returns the indices of the observations that have 'United' in their names. If we wanted to see the values of country that had 'United' in their names, we can use the value=TRUE argument:
```
> grep('United',world1$country,value=TRUE)
[1] "United Arab Emirates" "United Kingdom"
```
Notice that, since the first form of grep returns a vector of indices, we can use it as a subscript to get all the information about the countries that have 'United' in their names:
```
> world1[grep('United',world1$country),]
                 country   gdp income literacy    military cont
144 United Arab Emirates 23200  23818     77.3  1600000000   AS
145       United Kingdom 27700  28938     99.9 42836500000   EU
```
grep has a few optional arguments, some of which we'll look at later. One convenient argument is ignore.case=TRUE, which, as the name implies will look for the pattern we specified without regard to case.
strsplit strsplit takes a character vector, and breaks each element up into pieces, based on the value of the split= argument. This argument can be an ordinary text string, or a regular expression. Since the different elements of the vector may have different numbers of "pieces", the results from strsplit are always returned in a list. Here's a simple example:
```
> mystrings = c('the cat in the hat','green eggs and ham','fox in socks')
> parts = strsplit(mystrings,' ')
> parts 
[[1]]
[1] "the" "cat" "in"  "the" "hat"

[[2]]
[1] "green" "eggs"  "and"   "ham"

[[3]]
[1] "fox"   "in"    "socks"
```
While we haven't dealt much with lists before, one function that can be very useful is sapply; you can use sapply to operate on each element of a list, and it will, if possible, return the result as a vector. So to find the number of words in each of the character strings in mystrings, we could use:
```
> sapply(parts,length)
[1] 5 4 3
```
substring The substring function allows you to extract portions of a character string. Its first argument is a character string, or vector of character strings, and its second argument is the index (starting with 1) of the beginning of the desired substring. With no third argument, substring returns the string starting at the specified index and continuing to the end of the string; if a third argument is given, it represents the last index of the original string that will be included in the returned substring. Like many functions in R, its true value is that it is fully vectorized: you can extract substrings of a vector of character values in a single call. Here's an example of a simple use of substring
```
> strings = c('elephant','aardvark','chicken','dog','duck','frog')
> substring(strings,1,5)
[1] "eleph" "aardv" "chick" "dog"   "duck"  "frog"
```
Notice that, when a string is too short to fully meet a substringing request, no error or warning is raised, and substring returns as much os the string as is there.
Consider the following example, extracted from a web page. Each element of the character vector data consists of a name followed by five numbers. Extracting an individual field, say the field with the state names is straight forward:
```
> data = c("Lyndhurst      Ohio          199.02  15,074  30  5   25",
           "Southport Town New York      217.69  11,025  24  4   20",
           "Bedford        Massachusetts 221.20  12,658  28  0   28")
> states = substring(data,16,28)
> states
[1] "Ohio         " "New York     " "Massachusetts"
```
It is possible to extract all the fields at once, at the cost of a considerably more complex call to substring:
```
> starts = c(1,16,30,38,46,50,54)
> ends   = c(14,28,35,43,47,50,55)
> ldata = length(data)
> lstarts = length(starts)
> x = substring(data,rep(starts,rep(ldata,lstarts)),rep(ends,rep(ldata,lstarts)))
> matrix(x,ncol=lstarts)
     [,1]             [,2]            [,3]     [,4]     [,5] [,6] [,7]
[1,] "Lyndhurst     " "Ohio         " "199.02" "15,074" "30" "5"  "25"
[2,] "Southport Town" "New York     " "217.69" "11,025" "24" "4"  "20"
[3,] "Bedford       " "Massachusetts" "221.20" "12,658" "28" "0"  "28"
```
Like many functions in R, substring can appear on the left hand side of an assignment statement, making it easy to change parts of a character string based on the positions they're in. To change the third through fifth digits of a set of character strings representing numbers to 99, we could use:
```
> nums = c('12553','73911','842099','203','10')
> substring(nums,3,5) = '99'
> nums
[1] "12993"  "73991"  "849999" "209"    "10"
```
tolower, toupper These functions convert their arguments to all upper-case characters or all lower-case characters, respectively
sub, gsub These functions change a regular expression or text pattern to a different set of characters. They differ in that sub only changes the first occurence of the specified pattern, while gsub changes all of the occurences. Since numeric values in R cannot contain dollar signs or commas, one important use of gsub is to create numeric variables from text variables that represent numbers but contain commas or dollars. For example, in gathering the data for the world dataset that we've been using, I extracted the information about military spending from http://en.wikipedia.org/wiki/List_of_countries_by_military_expenditures. Here's an excerpt of some of the values from that page:
```
> values = c('370,700,000,000','205,326,700,000','67,490,000,000')
> as.numeric(values)
[1] NA NA NA
Warning message:
NAs introduced by coercion
```
The presence of the commas is preventing R from being able to convert the values into actual numbers. gsub easily solves the problem:
```
> as.numeric(gsub(',','',values))
[1] 370700000000 205326700000  67490000000
```

File translated from T_EX by T_TH, version 3.67.
On 7 Feb 2011, 15:33.