substring
The substring function allows you to extract portions of a character
string. Its first argument is a character string, or vector of character strings,
and its second argument is the index (starting with 1) of the beginning of the
desired substring. With no third argument, substring returns the string
starting at the specified index and continuing to the end of the string; if a
third argument is given, it represents the last index of the original string
that will be included in the returned substring. Like many functions in R, its
true value is that it is fully vectorized: you can extract substrings of a vector
of character values in a single call. Here's an example of a simple use of
substring
> strings = c('elephant','aardvark','chicken','dog','duck','frog')
> substring(strings,1,5)
[1] "eleph" "aardv" "chick" "dog" "duck" "frog"
Notice that, when a string is too short to fully meet a substringing
request, no error or warning is raised, and substring returns as much
os the string as is there.
Consider the following example, extracted from a web page. Each element of
the character vector data consists of a name followed by
five numbers. Extracting an individual field, say the field with the state
names is straight forward:
> data = c("Lyndhurst Ohio 199.02 15,074 30 5 25",
"Southport Town New York 217.69 11,025 24 4 20",
"Bedford Massachusetts 221.20 12,658 28 0 28")
> states = substring(data,16,28)
> states
[1] "Ohio " "New York " "Massachusetts"
It is possible to extract all the fields at once, at the cost of
a considerably more complex call to substring:
> starts = c(1,16,30,38,46,50,54)
> ends = c(14,28,35,43,47,50,55)
> ldata = length(data)
> lstarts = length(starts)
> x = substring(data,rep(starts,rep(ldata,lstarts)),rep(ends,rep(ldata,lstarts)))
> matrix(x,ncol=lstarts)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "Lyndhurst " "Ohio " "199.02" "15,074" "30" "5" "25"
[2,] "Southport Town" "New York " "217.69" "11,025" "24" "4" "20"
[3,] "Bedford " "Massachusetts" "221.20" "12,658" "28" "0" "28"
Like many functions in R, substring can appear on the left hand side
of an assignment statement, making it easy to change parts of a character string
based on the positions they're in. To change the third through fifth digits of
a set of character strings representing numbers to 99, we could use:
> nums = c('12553','73911','842099','203','10')
> substring(nums,3,5) = '99'
> nums
[1] "12993" "73991" "849999" "209" "10"
sub, gsub
These functions change a regular expression or text pattern to a different
set of characters. They differ in that sub only changes the first
occurence of the specified pattern, while gsub changes all of the
occurences. Since numeric values in R cannot contain dollar signs or commas,
one important use of gsub is to create numeric variables from text
variables that represent numbers but contain commas or dollars. For example,
in gathering the data for the world dataset that we've been using, I extracted
the information about military spending from
http://en.wikipedia.org/wiki/List_of_countries_by_military_expenditures. Here's an
excerpt of some of the values from that page:
> values = c('370,700,000,000','205,326,700,000','67,490,000,000')
> as.numeric(values)
[1] NA NA NA
Warning message:
NAs introduced by coercion
The presence of the commas is preventing R from being able to convert
the values into actual numbers. gsub easily solves the problem:
> as.numeric(gsub(',','',values))
[1] 370700000000 205326700000 67490000000