Matrices and Arrays

Note to BB: remember to start recording.

Let's review matrices

mat <- matrix(rnorm(12), nrow = 3, ncol = 4)
mat
##         [,1]    [,2]    [,3]    [,4]
## [1,]  0.5872 -1.1782  1.3072 -0.2861
## [2,]  0.9946 -0.1400 -0.6624  1.2640
## [3,] -1.4555  0.5503  0.4209 -0.3674

# vectorized calcs work with matrices too
mat * 4
##        [,1]    [,2]   [,3]   [,4]
## [1,]  2.349 -4.7126  5.229 -1.144
## [2,]  3.978 -0.5599 -2.650  5.056
## [3,] -5.822  2.2012  1.684 -1.470
mat <- cbind(mat, 1:3)
mat
##         [,1]    [,2]    [,3]    [,4] [,5]
## [1,]  0.5872 -1.1782  1.3072 -0.2861    1
## [2,]  0.9946 -0.1400 -0.6624  1.2640    2
## [3,] -1.4555  0.5503  0.4209 -0.3674    3

Arrays are like matrices but can have more or fewer than two dimensions.

arr <- array(rnorm(12), c(2, 3, 4))
arr
## , , 1
## 
##        [,1]    [,2]     [,3]
## [1,] -1.280  0.1234  0.06407
## [2,] -0.175 -1.4361 -1.48029
## 
## , , 2
## 
##        [,1]    [,2]     [,3]
## [1,] 0.2296  0.1599  0.05827
## [2,] 0.3107 -1.8740 -0.76463
## 
## , , 3
## 
##        [,1]    [,2]     [,3]
## [1,] -1.280  0.1234  0.06407
## [2,] -0.175 -1.4361 -1.48029
## 
## , , 4
## 
##        [,1]    [,2]     [,3]
## [1,] 0.2296  0.1599  0.05827
## [2,] 0.3107 -1.8740 -0.76463

Attributes

Objects have attributes.

attributes(mat)
## $dim
## [1] 3 5
rownames(mat) <- c("first", "middle", "last")
mat
##           [,1]    [,2]    [,3]    [,4] [,5]
## first   0.5872 -1.1782  1.3072 -0.2861    1
## middle  0.9946 -0.1400 -0.6624  1.2640    2
## last   -1.4555  0.5503  0.4209 -0.3674    3
attributes(mat)
## $dim
## [1] 3 5
## 
## $dimnames
## $dimnames[[1]]
## [1] "first"  "middle" "last"  
## 
## $dimnames[[2]]
## NULL

mat[4]
## [1] -1.178
attributes(mat) <- NULL
mat
##  [1]  0.5872  0.9946 -1.4555 -1.1782 -0.1400  0.5503  1.3072 -0.6624
##  [9]  0.4209 -0.2861  1.2640 -0.3674  1.0000  2.0000  3.0000
is.matrix(mat)
## [1] FALSE

What can you infer about what a matrix is in R?

What kind of object are the attributes themselves? How do I check?

Matrices are stored column-major

This is like Fortran but not like C.

mat <- matrix(1:12, 3, 4)
mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
c(mat)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

You can go smoothly back and forth between a matrix (or an array) and a vector:

identical(mat, matrix(c(mat), 3, 4))
## [1] TRUE
identical(mat, matrix(c(mat), 3, 4, byrow = TRUE))
## [1] FALSE

This is a common cause of bugs!

Missing values and other special values

Since it was designed by statisticians, R handles missing value very well relative to other languages.

vec <- rnorm(12)
vec[c(3, 5)] <- NA
vec
##  [1]  0.6694  0.8623      NA  0.4507      NA  0.4999 -1.3394 -1.6569
##  [9] -0.3677 -0.2878 -0.0341 -1.2096
length(vec)
## [1] 12
sum(vec)
## [1] NA
sum(vec, na.rm = TRUE)
## [1] -2.413
hist(vec)

plot of chunk unnamed-chunk-5

is.na(vec)
##  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE

Be careful because many R functions won't warn you that they are ignoring the missing values.

big <- Inf
big
## [1] Inf
big + 7
## [1] Inf
sqrt(-5)
## Warning: NaNs produced
## [1] NaN
big - big
## [1] NaN
1/0
## [1] Inf
vec <- c(vec, NULL)
vec
##  [1]  0.6694  0.8623      NA  0.4507      NA  0.4999 -1.3394 -1.6569
##  [9] -0.3677 -0.2878 -0.0341 -1.2096
length(vec)
## [1] 12
a <- NULL
a + 7
## numeric(0)
a[3, 4]
## NULL
is.null(a)
## [1] TRUE
myList <- list(a = 7, b = 5)
myList$a <- NULL  # works for data frames too
myList
## $b
## [1] 5

NA can hold a place but NULL cannot. NULL is useful for having a function argument default to 'nothing'. See help(crossprod), which can compute either \(X^{\top}X\) or \(X^{\top}Y\).

Logical vectors

answers <- c(TRUE, TRUE, FALSE, FALSE)
update <- c(TRUE, FALSE, TRUE, FALSE)

answers & update
## [1]  TRUE FALSE FALSE FALSE
answers | update
## [1]  TRUE  TRUE  TRUE FALSE
# note the vectorized boolean arithmetic

# what am I doing here?
sum(answers)
## [1] 2
mean(answers)
## [1] 0.5
answers + update
## [1] 2 1 1 0

What do you think R is doing to do arithmetic on logical vectors?

Tricks with logicals…

identical(answers & update, as.logical(answers * update))
## [1] TRUE
identical(answers | update, as.logical(answers + update))
## [1] TRUE

Data frames

A review from Module 1…

require(foreign)
vote = read.dta("../data/2004_labeled_processed_race.dta")
class(vote)
## [1] "data.frame"
head(vote)
##   state pres04    sex  race  age9 partyid income relign8 age60 age65
## 1     2      1 female white 25-29    <NA>   <NA>    <NA> 18-29 25-29
## 2     2      2   male white 18-24    <NA>   <NA>    <NA> 18-29 18-24
## 3     2      1 female black 30-39    <NA>   <NA>    <NA> 30-44 30-39
## 4     2      1 female black 30-39    <NA>   <NA>    <NA> 30-44 30-39
## 5     2      1 female white 40-44    <NA>   <NA>    <NA> 30-44 40-49
## 6     2      1 female white 30-39    <NA>   <NA>    <NA> 30-44 30-39
##   geocode sizeplac brnagain attend year region y
## 1       3    rural     <NA>   <NA> 2004      4 0
## 2       3    rural     <NA>   <NA> 2004      4 1
## 3       3    rural     <NA>   <NA> 2004      4 0
## 4       3    rural     <NA>   <NA> 2004      4 0
## 5       3    rural     <NA>   <NA> 2004      4 0
## 6       3    rural     <NA>   <NA> 2004      4 0
str(vote)
## 'data.frame':    76205 obs. of  17 variables:
##  $ state   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ pres04  : int  1 2 1 1 1 1 1 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 2 2 ...
##  $ race    : Factor w/ 5 levels "white","black",..: 1 1 2 2 1 1 1 1 1 1 ...
##  $ age9    : Factor w/ 9 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ partyid : Factor w/ 4 levels "democrat","republican",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ income  : Factor w/ 8 levels "under $15,000",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ relign8 : Factor w/ 8 levels "protestant","catholic",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ age60   : Factor w/ 4 levels "18-29","30-44",..: 1 1 2 2 2 2 2 1 1 1 ...
##  $ age65   : Factor w/ 6 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ geocode : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ sizeplac: Factor w/ 5 levels "city over 500,000",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ brnagain: Factor w/ 2 levels "yes","no": NA NA NA NA NA NA NA NA NA NA ...
##  $ attend  : Factor w/ 6 levels "more than once a week",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ year    : num  2004 2004 2004 2004 2004 ...
##  $ region  : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ y       : num  0 1 0 0 0 0 0 1 1 1 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr " 6 Jun 2007 14:53"
##  - attr(*, "formats")= chr  "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
##  - attr(*, "types")= int  251 251 251 251 251 251 251 251 251 251 ...
##  - attr(*, "val.labels")= chr  "stanum" "presak04" "sex" "race" ...
##  - attr(*, "var.labels")= chr  "state id" "in today's election for president, did you just vote for:" "are you:" "are you:" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 14
##   ..$ stanum  : Named num 2
##   .. ..- attr(*, "names")= chr "alaska"
##   ..$ presak04: Named num  0 1 2 3 9
##   .. ..- attr(*, "names")= chr  "did not vote" "kerry" "bush" "nader" ...
##   ..$ sex     : Named num  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ race    : Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "white" "black" "hispanic/latino" "asian" ...
##   ..$ age9    : Named num  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-44" ...
##   ..$ partyid : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "democrat" "republican" "independent" "something else"
##   ..$ income  : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "under $15,000" "$15,000-$29,999" "$30,000-$49,999" "$50,000-$74,999" ...
##   ..$ relign8 : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "protestant" "catholic" "mormon/lds" "other christian" ...
##   ..$ age60   : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "18-29" "30-44" "45-59" "60 or over"
##   ..$ age65   : Named num  1 2 3 4 5 6
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-49" ...
##   ..$ geocode : Named num  1 2 3
##   .. ..- attr(*, "names")= chr  "juneau/fairbanks/rural" "anchorage" "anchorage-fairbanks corridor"
##   ..$ sizeplac: Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "city over 500,000" "city: 50,000 to 500,000" "suburbs" "city: 10,000 to 49,999" ...
##   ..$ brnagain: Named num  1 2
##   .. ..- attr(*, "names")= chr  "yes" "no"
##   ..$ attend  : Named num  1 2 3 4 5 9
##   .. ..- attr(*, "names")= chr  "more than once a week" "once a week" "a few times a month" "a few times a year" ...

Data frames are (special) lists!

is.list(vote)
## [1] TRUE
length(vote)
## [1] 17
vote[[3]][1:5]
## [1] female male   female female female
## Levels: male female
lapply(vote, class)
## $state
## [1] "integer"
## 
## $pres04
## [1] "integer"
## 
## $sex
## [1] "factor"
## 
## $race
## [1] "factor"
## 
## $age9
## [1] "factor"
## 
## $partyid
## [1] "factor"
## 
## $income
## [1] "factor"
## 
## $relign8
## [1] "factor"
## 
## $age60
## [1] "factor"
## 
## $age65
## [1] "factor"
## 
## $geocode
## [1] "integer"
## 
## $sizeplac
## [1] "factor"
## 
## $brnagain
## [1] "factor"
## 
## $attend
## [1] "factor"
## 
## $year
## [1] "numeric"
## 
## $region
## [1] "numeric"
## 
## $y
## [1] "numeric"

lapply() is a function used on lists; it works here to apply the class() function to each element of the list, which in this case is each field/column.

But lists are also vectors!

length(vote)
## [1] 17
someFields <- vote[c(3, 5)]
head(someFields)
##      sex  age9
## 1 female 25-29
## 2   male 18-24
## 3 female 30-39
## 4 female 30-39
## 5 female 40-44
## 6 female 30-39
identical(vote[c(3, 5)], vote[, c(3, 5)])
## [1] TRUE

In general the placement of commas in R is crucial, but here, two different operations give the same result because of the underlying structure of data frames.

Factors

class(vote$sizeplac)
## [1] "factor"
head(vote$sizeplac)  # What order are the factors in?
## [1] rural rural rural rural rural rural
## 5 Levels: city over 500,000 city: 50,000 to 500,000 ... rural
levels(vote[["sizeplac"]])
## [1] "city over 500,000"       "city: 50,000 to 500,000"
## [3] "suburbs"                 "city: 10,000 to 49,999" 
## [5] "rural"
summary(vote$sizeplac)
##       city over 500,000 city: 50,000 to 500,000                 suburbs 
##                    5882                   15462                   28796 
##  city: 10,000 to 49,999                   rural                    NA's 
##                    8449                   17501                     115

Ordering the Factor

vote <- within(vote, sizeplac_ord <- ordered(sizeplac, levels = levels(sizeplac)[c(5, 
    3, 4, 2, 1)]))
head(vote$sizeplac_ord)
## [1] rural rural rural rural rural rural
## 5 Levels: rural < suburbs < ... < city over 500,000
levels(vote$sizeplac_ord)
## [1] "rural"                   "suburbs"                
## [3] "city: 10,000 to 49,999"  "city: 50,000 to 500,000"
## [5] "city over 500,000"

Try to decipher what I just did with that complicated single line of code.

Reclassifying Factors

students <- factor(c("basic", "proficient", "advanced", "basic", "advanced", 
    "minimal"))
levels(students)
## [1] "advanced"   "basic"      "minimal"    "proficient"
unclass(students)
## [1] 2 4 1 2 1 3
## attr(,"levels")
## [1] "advanced"   "basic"      "minimal"    "proficient"
students <- factor(c("basic", "proficient", "advanced", "basic", "advanced", 
    "minimal"))
score = c(minimal = 3, basic = 1, advanced = 13, proficient = 7)  # a named vector
score["advanced"]  # look up by name
## advanced 
##       13
students[3]
## [1] advanced
## Levels: advanced basic minimal proficient
score[students[3]]
## minimal 
##       3
score[as.character(students[3])]
## advanced 
##       13

What's going wrong?

Subsetting

There are many ways to select subsets in R. The syntax below is useful for vectors, matrices, data frames, arrays and lists.

vec <- rnorm(20)
mat <- matrix(vec, 4, 5)
rownames(mat) <- letters[1:4]
mat
##      [,1]     [,2]    [,3]    [,4]     [,5]
## a -0.3112 -1.33090 -0.6350 -0.5241 -1.71910
## b -1.4120  2.11462  1.0845 -0.4235  0.61913
## c -1.7273 -0.28582  1.7979  1.5908  0.48541
## d  0.9401  0.01268 -0.2703  0.5060  0.09669

1) by direct indexing

vec[c(3, 5, 12:14)]
## [1] -1.7273 -1.3309 -0.2703 -0.5241 -0.4235
vec[-c(3, 5)]
##  [1] -0.31122 -1.41203  0.94013  2.11462 -0.28582  0.01268 -0.63495
##  [8]  1.08454  1.79794 -0.27027 -0.52412 -0.42352  1.59083  0.50596
## [15] -1.71910  0.61913  0.48541  0.09669
mat[c(2, 4), 5]
##       b       d 
## 0.61913 0.09669
rowInd <- c(1, 3, 4)
colInd <- c(2, 2, 1)
mat[cbind(rowInd, colInd)]
## [1] -1.3309 -0.2858  0.9401

Note the last usage where we give it a 2-column matrix of indices

2) by a vector of logicals

cond <- vec > 0
vec[cond]
##  [1] 0.94013 2.11462 0.01268 1.08454 1.79794 1.59083 0.50596 0.61913
##  [9] 0.48541 0.09669
mat[mat[, 1] > 0, ]
## [1]  0.94013  0.01268 -0.27027  0.50596  0.09669

3) by a vector of names

mat[c("a", "d", "a"), ]
##      [,1]     [,2]    [,3]    [,4]     [,5]
## a -0.3112 -1.33090 -0.6350 -0.5241 -1.71910
## d  0.9401  0.01268 -0.2703  0.5060  0.09669
## a -0.3112 -1.33090 -0.6350 -0.5241 -1.71910

4) using subset()

subset(mtcars, mpg > 20)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Assignment into subsets

We can assign into subsets by using similar syntax, as we saw with vectors.

vec[c(3, 5, 12:14)] <- 1:5
mat[2, 3:5] <- rnorm(3)
mat[mat[, 1] > 0, ] <- -Inf

Strings

R has lots of functionality for character strings. Usually these are stored as vectors of strings, each with arbitrary length.

chars <- c('hi', 'hallo', "mother's", 'father\'s', "He said, \"hi\"" )
length(chars)
## [1] 5
nchar(chars)
## [1]  2  5  8  8 13
paste("bill", "clinton", sep = " ")  # paste together a set of strings
## [1] "bill clinton"
paste(chars, collapse = ' ')  # paste together things from a vector
## [1] "hi hallo mother's father's He said, \"hi\""

strsplit("This is the R bootcamp", split = " ")
## [[1]]
## [1] "This"     "is"       "the"      "R"        "bootcamp"
substring(chars, 2, 3)
## [1] "i"  "al" "ot" "at" "e "
chars2 <- chars
substring(chars2, 2, 3) <- "ZZ"
chars2
## [1] "hZ"              "hZZlo"           "mZZher's"        "fZZher's"       
## [5] "HZZsaid, \"hi\""

We can search for patterns in character vectors and replace patterns (both vectorized!)

grep("ther", chars)
## [1] 3 4
gsub("hi", "Hi", chars)
## [1] "Hi"              "hallo"           "mother's"        "father's"       
## [5] "He said, \"Hi\""

Regular expressions (regex or regexp)

Some of you may be familiar with using regular expressions, which is functionality for doing sophisticated pattern matching and replacement with strings. Python and Perl are both used extensively for such text manipulation.

R has a full set of regular expression capabilities available through the grep(), gregexpr(), and gsub() functions (among others - many R functions will work with regular expressions).

You can basically do any regular expression/string manipulations in R, though the syntax may be a bit clunky at times.

The working directory

To read and write from R, you need to have a firm grasp of where in the computer's filesystem you are reading and writing from.

getwd()  # what directory will R look in?
## [1] "/accounts/gen/vis/paciorek/staff/workshops/r-bootcamp-2013/modules"
setwd("~/Desktop/r-bootcamp-2013")  # change the working directory
setwd("/Users/paciorek/Desktop/tmp")  # absolute path
## Error: cannot change working directory
getwd()
## [1] "/accounts/gen/vis/paciorek/Desktop/r-bootcamp-2013"
setwd("../r-bootcamp-2013")  # relative path

Many errors and much confusion result from you and R not being on the same page in terms of where in the directory structure you are.

Reading data into R

The workhorse for reading into a data frame is read.table(), which allows any separator (CSV, tab-delimited, etc.). read.csv() is a special case of read.table() for CSV files.

rta <- read.table("../data/RTAData.csv", sep = ",", head = TRUE)
rta[1:5, 1:5]
##               time X40010 X40015 X40020 X40025
## 1 2010-03-01 14:58    821    209    828    258
## 2 2010-03-01 15:01    804    209    804    248
## 3 2010-03-01 15:04    892    212    801    237
## 4 2010-03-01 15:07    857    214    821    243
## 5 2010-03-01 15:10    849    222    834    252
dim(rta)
## [1] 120822     62
# great, we're all set, right?  Not so fast...
unlist(lapply(rta, class))[1:5]
##     time   X40010   X40015   X40020   X40025 
## "factor" "factor" "factor" "factor" "factor"
# ?read.table
rta2 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE)
rta2[3, 3]
## [1] "212"
unlist(lapply(rta2, class))[1:5]
##        time      X40010      X40015      X40020      X40025 
## "character" "character" "character" "character" "character"
# let's delve more deeply
levels(rta[, 2])[c(1:5, 3041:3044)]
## [1] ""     "1000" "1001" "1002" "1003" "997"  "998"  "999"  "x"
rta3 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE, 
    na.strings = c("NA", "x"))
unlist(lapply(rta3, class))[1:5]
##        time      X40010      X40015      X40020      X40025 
## "character"   "integer"   "integer"   "integer"   "integer"

# checking...
missing <- which(rta[, 2] == "")
missing[1:5]
## [1] 1167 1168 1169 1170 1171
rta3[head(missing), ]
##                  time X40010 X40015 X40020 X40025 X40030 X40035 X40040
## 1167 2010-03-04 01:16     NA     NA     NA     NA     NA     NA     NA
## 1168 2010-03-04 01:19     NA     NA     NA     NA     NA     NA     NA
## 1169 2010-03-04 01:22     NA     NA     NA     NA     NA     NA     NA
## 1170 2010-03-04 01:25     NA     NA     NA     NA     NA     NA     NA
## 1171 2010-03-04 01:28     NA     NA     NA     NA     NA     NA     NA
## 1172 2010-03-04 01:31     NA     NA     NA     NA     NA     NA     NA
##      X40045 X40050 X40055 X40060 X40065 X40070 X40075 X40080 X40085 X40090
## 1167     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1168     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1169     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1170     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1171     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1172     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      X40092 X40095 X40100 X40105 X40110 X40115 X40120 X40125 X40130 X40135
## 1167     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1168     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1169     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1170     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1171     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1172     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      X40140 X40145 X40150 X41010 X41015 X41020 X41025 X41030 X41035 X41040
## 1167     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1168     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1169     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1170     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1171     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1172     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      X41045 X41050 X41055 X41060 X41065 X41070 X41075 X41080 X41085 X41090
## 1167     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1168     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1169     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1170     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1171     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1172     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      X41095 X41100 X41105 X41110 X41115 X41120 X41125 X41130 X41135 X41140
## 1167     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1168     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1169     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1170     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1171     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 1172     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
##      X41145 X41150 X41155 X41160
## 1167     NA     NA     NA     NA
## 1168     NA     NA     NA     NA
## 1169     NA     NA     NA     NA
## 1170     NA     NA     NA     NA
## 1171     NA     NA     NA     NA
## 1172     NA     NA     NA     NA

It's good to first look at your data in plain text format outside of R and then to check it after you've read it into R.

Other ways to read data into R

The read.table() family of functions just skims the surface of things…

1) You can also read in a file as vector of characters, one character string per line of the file with readLines(), and then post-process it. 2) You can read fixed width format (constant number of characters per field) with read.fwf().

Reading 'foreign' format data

We've already seen an example of reading data produced by another statistical package with read.dta. There are a number of other formats that we can handle for either reading or writing. Let's see library(help = foreign).

R can also read in (and write out) Excel files, netCDF files, HDF5 files, etc., in many cases through add-on packages from CRAN.

A pause for a (gentle) diatribe:

Please try to avoid using Excel files as a data storage format. It's proprietary, complicated (can have multiple sheets), allows a limited number of rows/columns, and files are not easily readable/viewable (unlike simple text files).

Writing data out from R

Here you have a number of options.

1) You can write out R objects to an R Data file, as we've seen, using save() and save.image(). 2) You can use write.csv() and write.table() to write data frames/matrices to flat text files with delimiters such as comma and tab. 3) You can use write() to write out matrices in a simple flat text format. 4) You can use cat() to write to a file, while controlling the formatting to a fine degree. 5) You can write out in the various file formats mentioned on the previous slide

Writing out plots and tables

pdf("myplot.pdf", width = 7, height = 7)
x <- rnorm(10)
y <- rnorm(10)
plot(x, y)
dev.off()
## pdf 
##   2

xtable() formats tables for HTML and Latex (the default).

require(xtable)
print(xtable(table(vote$race, vote$pres04)), type = "html")
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Fri Aug 23 16:59:02 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> 0 </TH> <TH> 1 </TH> <TH> 2 </TH> <TH> 3 </TH> <TH> 4 </TH> <TH> 9 </TH>  </TR>
##   <TR> <TD align="right"> white </TD> <TD align="right"> 111 </TD> <TD align="right"> 26184 </TD> <TD align="right"> 33045 </TD> <TD align="right"> 417 </TD> <TD align="right">  14 </TD> <TD align="right"> 409 </TD> </TR>
##   <TR> <TD align="right"> black </TD> <TD align="right">  18 </TD> <TD align="right"> 6183 </TD> <TD align="right"> 824 </TD> <TD align="right">  56 </TD> <TD align="right">   0 </TD> <TD align="right">  21 </TD> </TR>
##   <TR> <TD align="right"> hispanic/latino </TD> <TD align="right">   6 </TD> <TD align="right"> 2665 </TD> <TD align="right"> 1639 </TD> <TD align="right">  34 </TD> <TD align="right">   3 </TD> <TD align="right">  49 </TD> </TR>
##   <TR> <TD align="right"> asian </TD> <TD align="right">   0 </TD> <TD align="right"> 626 </TD> <TD align="right"> 384 </TD> <TD align="right">   7 </TD> <TD align="right">   1 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> other </TD> <TD align="right">  16 </TD> <TD align="right"> 1036 </TD> <TD align="right"> 653 </TD> <TD align="right">  22 </TD> <TD align="right">   0 </TD> <TD align="right">  32 </TD> </TR>
##    </TABLE>

Recall our discussion of the summary() function used on either a vector or a regression object. What do you think is going on with call to print() above?

Breakout

  1. Using the voting/presidential preference dataset, create a new column based on age9 that gives, as a numeric value, the midpoint of the age range assigned to each person. Try to do this with a combination of subsetting and string operations (i.e., can you convert the character numbers to actual numbers). To simplify things, feel free to get rid of the rows for ages “75 and over”.

  2. Go back to slide 6 on logical vectors and figure out what is going on in the last few lines of code.

  3. Go back to slide 22 and figure out what is going on with that complicated last line of code.