August 2013, UC Berkeley
Chris Paciorek
Note to BB: remember to start recording.
Let's review matrices
mat <- matrix(rnorm(12), nrow = 3, ncol = 4)
mat
## [,1] [,2] [,3] [,4]
## [1,] 1.932 -0.4547 -0.3258 -0.8163
## [2,] 1.152 -0.1366 -0.4712 0.8153
## [3,] 1.409 -0.3346 0.9246 -2.6196
# vectorized calcs work with matrices too
mat * 4
## [,1] [,2] [,3] [,4]
## [1,] 7.730 -1.8188 -1.303 -3.265
## [2,] 4.608 -0.5465 -1.885 3.261
## [3,] 5.638 -1.3384 3.698 -10.478
mat <- cbind(mat, 1:3)
mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.932 -0.4547 -0.3258 -0.8163 1
## [2,] 1.152 -0.1366 -0.4712 0.8153 2
## [3,] 1.409 -0.3346 0.9246 -2.6196 3
Arrays are like matrices but can have more or fewer than two dimensions.
arr <- array(rnorm(12), c(2, 3, 4))
arr
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.119 -2.535 0.9842
## [2,] 0.232 1.446 -0.6479
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] -0.2608 -0.2994 1.0665
## [2,] 0.9045 -0.6592 -0.3387
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 0.119 -2.535 0.9842
## [2,] 0.232 1.446 -0.6479
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] -0.2608 -0.2994 1.0665
## [2,] 0.9045 -0.6592 -0.3387
Objects have attributes.
attributes(mat)
## $dim
## [1] 3 5
rownames(mat) <- c("first", "middle", "last")
mat
## [,1] [,2] [,3] [,4] [,5]
## first 1.932 -0.4547 -0.3258 -0.8163 1
## middle 1.152 -0.1366 -0.4712 0.8153 2
## last 1.409 -0.3346 0.9246 -2.6196 3
attributes(mat)
## $dim
## [1] 3 5
##
## $dimnames
## $dimnames[[1]]
## [1] "first" "middle" "last"
##
## $dimnames[[2]]
## NULL
mat[4]
## [1] -0.4547
attributes(mat) <- NULL
mat
## [1] 1.9324 1.1520 1.4094 -0.4547 -0.1366 -0.3346 -0.3258 -0.4712
## [9] 0.9246 -0.8163 0.8153 -2.6196 1.0000 2.0000 3.0000
is.matrix(mat)
## [1] FALSE
What can you infer about what a matrix is in R?
What kind of object are the attributes themselves? How do I check?
This is like Fortran but not like C.
mat <- matrix(1:12, 3, 4)
mat
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
c(mat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
You can go smoothly back and forth between a matrix (or an array) and a vector:
identical(mat, matrix(c(mat), 3, 4))
## [1] TRUE
identical(mat, matrix(c(mat), 3, 4, byrow = TRUE))
## [1] FALSE
This is a common cause of bugs!
Since it was designed by statisticians, R handles missing value very well relative to other languages.
NA
is a missing valuevec <- rnorm(12)
vec[c(3, 5)] <- NA
vec
## [1] -0.7016 -1.3075 NA -1.4403 NA -0.8394 1.3689 -0.4037
## [9] 2.6236 -1.2082 0.8394 -0.4275
length(vec)
## [1] 12
sum(vec)
## [1] NA
sum(vec, na.rm = TRUE)
## [1] -1.496
hist(vec)
plot of chunk unnamed-chunk-5
is.na(vec)
## [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE
Be careful because many R functions won't warn you that they are ignoring the missing values.
big <- Inf
big
## [1] Inf
big + 7
## [1] Inf
NaN
stands for Not a Numbersqrt(-5)
## Warning: NaNs produced
## [1] NaN
big - big
## [1] NaN
1/0
## [1] Inf
NULL
vec <- c(vec, NULL)
vec
## [1] -0.7016 -1.3075 NA -1.4403 NA -0.8394 1.3689 -0.4037
## [9] 2.6236 -1.2082 0.8394 -0.4275
length(vec)
## [1] 12
a <- NULL
a + 7
## numeric(0)
a[3, 4]
## NULL
is.null(a)
## [1] TRUE
myList <- list(a = 7, b = 5)
myList$a <- NULL # works for data frames too
myList
## $b
## [1] 5
NA
can hold a place but NULL
cannot. NULL
is useful for having a function argument default to 'nothing'. See help(crossprod)
, which can compute either or
.
answers <- c(TRUE, TRUE, FALSE, FALSE)
update <- c(TRUE, FALSE, TRUE, FALSE)
answers & update
## [1] TRUE FALSE FALSE FALSE
answers | update
## [1] TRUE TRUE TRUE FALSE
# note the vectorized boolean arithmetic
# what am I doing here?
sum(answers)
## [1] 2
mean(answers)
## [1] 0.5
answers + update
## [1] 2 1 1 0
What do you think R is doing to do arithmetic on logical vectors?
Tricks with logicals...
identical(answers & update, as.logical(answers * update))
## [1] TRUE
identical(answers | update, as.logical(answers + update))
## [1] TRUE
A review from Module 1...
require(foreign)
## Loading required package: foreign
vote = read.dta("../data/2004_labeled_processed_race.dta")
class(vote)
## [1] "data.frame"
head(vote)
## state pres04 sex race age9 partyid income relign8 age60 age65
## 1 2 1 female white 25-29 <NA> <NA> <NA> 18-29 25-29
## 2 2 2 male white 18-24 <NA> <NA> <NA> 18-29 18-24
## 3 2 1 female black 30-39 <NA> <NA> <NA> 30-44 30-39
## 4 2 1 female black 30-39 <NA> <NA> <NA> 30-44 30-39
## 5 2 1 female white 40-44 <NA> <NA> <NA> 30-44 40-49
## 6 2 1 female white 30-39 <NA> <NA> <NA> 30-44 30-39
## geocode sizeplac brnagain attend year region y
## 1 3 rural <NA> <NA> 2004 4 0
## 2 3 rural <NA> <NA> 2004 4 1
## 3 3 rural <NA> <NA> 2004 4 0
## 4 3 rural <NA> <NA> 2004 4 0
## 5 3 rural <NA> <NA> 2004 4 0
## 6 3 rural <NA> <NA> 2004 4 0
str(vote)
## 'data.frame': 76205 obs. of 17 variables:
## $ state : int 2 2 2 2 2 2 2 2 2 2 ...
## $ pres04 : int 1 2 1 1 1 1 1 2 2 2 ...
## $ sex : Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 2 2 ...
## $ race : Factor w/ 5 levels "white","black",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ age9 : Factor w/ 9 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
## $ partyid : Factor w/ 4 levels "democrat","republican",..: NA NA NA NA NA NA NA NA NA NA ...
## $ income : Factor w/ 8 levels "under $15,000",..: NA NA NA NA NA NA NA NA NA NA ...
## $ relign8 : Factor w/ 8 levels "protestant","catholic",..: NA NA NA NA NA NA NA NA NA NA ...
## $ age60 : Factor w/ 4 levels "18-29","30-44",..: 1 1 2 2 2 2 2 1 1 1 ...
## $ age65 : Factor w/ 6 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
## $ geocode : int 3 3 3 3 3 3 3 3 3 3 ...
## $ sizeplac: Factor w/ 5 levels "city over 500,000",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ brnagain: Factor w/ 2 levels "yes","no": NA NA NA NA NA NA NA NA NA NA ...
## $ attend : Factor w/ 6 levels "more than once a week",..: NA NA NA NA NA NA NA NA NA NA ...
## $ year : num 2004 2004 2004 2004 2004 ...
## $ region : num 4 4 4 4 4 4 4 4 4 4 ...
## $ y : num 0 1 0 0 0 0 0 1 1 1 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr " 6 Jun 2007 14:53"
## - attr(*, "formats")= chr "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
## - attr(*, "types")= int 251 251 251 251 251 251 251 251 251 251 ...
## - attr(*, "val.labels")= chr "stanum" "presak04" "sex" "race" ...
## - attr(*, "var.labels")= chr "state id" "in today's election for president, did you just vote for:" "are you:" "are you:" ...
## - attr(*, "version")= int 8
## - attr(*, "label.table")=List of 14
## ..$ stanum : Named num 2
## .. ..- attr(*, "names")= chr "alaska"
## ..$ presak04: Named num 0 1 2 3 9
## .. ..- attr(*, "names")= chr "did not vote" "kerry" "bush" "nader" ...
## ..$ sex : Named num 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ race : Named num 1 2 3 4 5
## .. ..- attr(*, "names")= chr "white" "black" "hispanic/latino" "asian" ...
## ..$ age9 : Named num 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "18-24" "25-29" "30-39" "40-44" ...
## ..$ partyid : Named num 1 2 3 4
## .. ..- attr(*, "names")= chr "democrat" "republican" "independent" "something else"
## ..$ income : Named num 1 2 3 4 5 6 7 8
## .. ..- attr(*, "names")= chr "under $15,000" "$15,000-$29,999" "$30,000-$49,999" "$50,000-$74,999" ...
## ..$ relign8 : Named num 1 2 3 4 5 6 7 8
## .. ..- attr(*, "names")= chr "protestant" "catholic" "mormon/lds" "other christian" ...
## ..$ age60 : Named num 1 2 3 4
## .. ..- attr(*, "names")= chr "18-29" "30-44" "45-59" "60 or over"
## ..$ age65 : Named num 1 2 3 4 5 6
## .. ..- attr(*, "names")= chr "18-24" "25-29" "30-39" "40-49" ...
## ..$ geocode : Named num 1 2 3
## .. ..- attr(*, "names")= chr "juneau/fairbanks/rural" "anchorage" "anchorage-fairbanks corridor"
## ..$ sizeplac: Named num 1 2 3 4 5
## .. ..- attr(*, "names")= chr "city over 500,000" "city: 50,000 to 500,000" "suburbs" "city: 10,000 to 49,999" ...
## ..$ brnagain: Named num 1 2
## .. ..- attr(*, "names")= chr "yes" "no"
## ..$ attend : Named num 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr "more than once a week" "once a week" "a few times a month" "a few times a year" ...
is.list(vote)
## [1] TRUE
length(vote)
## [1] 17
vote[[3]][1:5]
## [1] female male female female female
## Levels: male female
lapply(vote, class)
## $state
## [1] "integer"
##
## $pres04
## [1] "integer"
##
## $sex
## [1] "factor"
##
## $race
## [1] "factor"
##
## $age9
## [1] "factor"
##
## $partyid
## [1] "factor"
##
## $income
## [1] "factor"
##
## $relign8
## [1] "factor"
##
## $age60
## [1] "factor"
##
## $age65
## [1] "factor"
##
## $geocode
## [1] "integer"
##
## $sizeplac
## [1] "factor"
##
## $brnagain
## [1] "factor"
##
## $attend
## [1] "factor"
##
## $year
## [1] "numeric"
##
## $region
## [1] "numeric"
##
## $y
## [1] "numeric"
lapply()
is a function used on lists; it works here to apply the class()
function to each element of the list, which in this case is each field/column.
length(vote)
## [1] 17
someFields <- vote[c(3, 5)]
head(someFields)
## sex age9
## 1 female 25-29
## 2 male 18-24
## 3 female 30-39
## 4 female 30-39
## 5 female 40-44
## 6 female 30-39
identical(vote[c(3, 5)], vote[, c(3, 5)])
## [1] TRUE
In general the placement of commas in R is crucial, but here, two different operations give the same result because of the underlying structure of data frames.
class(vote$sizeplac)
## [1] "factor"
head(vote$sizeplac) # What order are the factors in?
## [1] rural rural rural rural rural rural
## 5 Levels: city over 500,000 city: 50,000 to 500,000 ... rural
levels(vote[["sizeplac"]])
## [1] "city over 500,000" "city: 50,000 to 500,000"
## [3] "suburbs" "city: 10,000 to 49,999"
## [5] "rural"
summary(vote$sizeplac)
## city over 500,000 city: 50,000 to 500,000 suburbs
## 5882 15462 28796
## city: 10,000 to 49,999 rural NA's
## 8449 17501 115
vote <- within(vote, sizeplac_ord <- ordered(sizeplac, levels = levels(sizeplac)[c(5,
3, 4, 2, 1)]))
head(vote$sizeplac_ord)
## [1] rural rural rural rural rural rural
## 5 Levels: rural < suburbs < ... < city over 500,000
levels(vote$sizeplac_ord)
## [1] "rural" "suburbs"
## [3] "city: 10,000 to 49,999" "city: 50,000 to 500,000"
## [5] "city over 500,000"
Try to decipher what I just did with that complicated single line of code.
students <- factor(c("basic", "proficient", "advanced", "basic", "advanced",
"minimal"))
levels(students)
## [1] "advanced" "basic" "minimal" "proficient"
unclass(students)
## [1] 2 4 1 2 1 3
## attr(,"levels")
## [1] "advanced" "basic" "minimal" "proficient"
students <- factor(c("basic", "proficient", "advanced", "basic", "advanced",
"minimal"))
score = c(minimal = 3, basic = 1, advanced = 13, proficient = 7) # a named vector
score["advanced"] # look up by name
## advanced
## 13
students[3]
## [1] advanced
## Levels: advanced basic minimal proficient
score[students[3]]
## minimal
## 3
score[as.character(students[3])]
## advanced
## 13
What's going wrong?
There are many ways to select subsets in R. The syntax below is useful for vectors, matrices, data frames, arrays and lists.
vec <- rnorm(20)
mat <- matrix(vec, 4, 5)
rownames(mat) <- letters[1:4]
mat
## [,1] [,2] [,3] [,4] [,5]
## a -1.1228 1.5660 -0.1224 -1.77071 0.2603
## b 1.0336 0.8607 -1.9927 0.53826 -0.3715
## c 0.1648 -0.4020 -0.7387 1.20478 1.5358
## d -0.3601 -1.3116 -0.2371 0.06624 -2.0234
vec[c(3, 5, 12:14)]
## [1] 0.1648 1.5660 -0.2371 -1.7707 0.5383
vec[-c(3, 5)]
## [1] -1.12280 1.03364 -0.36007 0.86069 -0.40202 -1.31160 -0.12240
## [8] -1.99270 -0.73866 -0.23711 -1.77071 0.53826 1.20478 0.06624
## [15] 0.26027 -0.37151 1.53584 -2.02344
mat[c(2, 4), 5]
## b d
## -0.3715 -2.0234
rowInd <- c(1, 3, 4)
colInd <- c(2, 2, 1)
mat[cbind(rowInd, colInd)]
## [1] 1.5660 -0.4020 -0.3601
Note the last usage where we give it a 2-column matrix of indices
cond <- vec > 0
vec[cond]
## [1] 1.03364 0.16476 1.56601 0.86069 0.53826 1.20478 0.06624 0.26027 1.53584
mat[mat[, 1] > 0, ]
## [,1] [,2] [,3] [,4] [,5]
## b 1.0336 0.8607 -1.9927 0.5383 -0.3715
## c 0.1648 -0.4020 -0.7387 1.2048 1.5358
mat[c("a", "d", "a"), ]
## [,1] [,2] [,3] [,4] [,5]
## a -1.1228 1.566 -0.1224 -1.77071 0.2603
## d -0.3601 -1.312 -0.2371 0.06624 -2.0234
## a -1.1228 1.566 -0.1224 -1.77071 0.2603
subset(mtcars, mpg > 20)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
We can assign into subsets by using similar syntax, as we saw with vectors.
vec[c(3, 5, 12:14)] <- 1:5
mat[2, 3:5] <- rnorm(3)
mat[mat[, 1] > 0, ] <- -Inf
R has lots of functionality for character strings. Usually these are stored as vectors of strings, each with arbitrary length.
chars <- c('hi', 'hallo', "mother's", 'father\'s', "He said, \"hi\"" )
length(chars)
## [1] 5
nchar(chars)
## [1] 2 5 8 8 13
paste("bill", "clinton", sep = " ") # paste together a set of strings
## [1] "bill clinton"
paste(chars, collapse = ' ') # paste together things from a vector
## [1] "hi hallo mother's father's He said, \"hi\""
strsplit("This is the R bootcamp", split = " ")
## [[1]]
## [1] "This" "is" "the" "R" "bootcamp"
substring(chars, 2, 3)
## [1] "i" "al" "ot" "at" "e "
chars2 <- chars
substring(chars2, 2, 3) <- "ZZ"
chars2
## [1] "hZ" "hZZlo" "mZZher's" "fZZher's"
## [5] "HZZsaid, \"hi\""
We can search for patterns in character vectors and replace patterns (both vectorized!)
grep("ther", chars)
## [1] 3 4
gsub("hi", "Hi", chars)
## [1] "Hi" "hallo" "mother's" "father's"
## [5] "He said, \"Hi\""
Some of you may be familiar with using regular expressions, which is functionality for doing sophisticated pattern matching and replacement with strings. Python and Perl are both used extensively for such text manipulation.
R has a full set of regular expression capabilities available through the grep(), gregexpr(), and gsub() functions (among others - many R functions will work with regular expressions).
You can basically do any regular expression/string manipulations in R, though the syntax may be a bit clunky at times.
To read and write from R, you need to have a firm grasp of where in the computer's filesystem you are reading and writing from.
getwd() # what directory will R look in?
## [1] "/accounts/gen/vis/paciorek/staff/workshops/r-bootcamp-2013/modules"
setwd("~/Desktop/r-bootcamp-2013") # change the working directory
setwd("/Users/paciorek/Desktop/tmp") # absolute path
## Error: cannot change working directory
getwd()
## [1] "/accounts/gen/vis/paciorek/Desktop/r-bootcamp-2013"
setwd("../r-bootcamp-2013") # relative path
Many errors and much confusion result from you and R not being on the same page in terms of where in the directory structure you are.
The workhorse for reading into a data frame is read.table(), which allows any separator (CSV, tab-delimited, etc.). read.csv() is a special case of read.table() for CSV files.
rta <- read.table("../data/RTAData.csv", sep = ",", head = TRUE)
rta[1:5, 1:5]
## time X40010 X40015 X40020 X40025
## 1 2010-03-01 14:58 821 209 828 258
## 2 2010-03-01 15:01 804 209 804 248
## 3 2010-03-01 15:04 892 212 801 237
## 4 2010-03-01 15:07 857 214 821 243
## 5 2010-03-01 15:10 849 222 834 252
dim(rta)
## [1] 120822 62
# great, we're all set, right? Not so fast...
unlist(lapply(rta, class))[1:5]
## time X40010 X40015 X40020 X40025
## "factor" "factor" "factor" "factor" "factor"
# ?read.table
rta2 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE)
rta2[3, 3]
## [1] "212"
unlist(lapply(rta2, class))[1:5]
## time X40010 X40015 X40020 X40025
## "character" "character" "character" "character" "character"
# let's delve more deeply
levels(rta[, 2])[c(1:5, 3041:3044)]
## [1] "" "1000" "1001" "1002" "1003" "997" "998" "999" "x"
rta3 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE, stringsAsFactors = FALSE,
na.strings = c("NA", "x"))
unlist(lapply(rta3, class))[1:5]
## time X40010 X40015 X40020 X40025
## "character" "integer" "integer" "integer" "integer"
# checking...
missing <- which(rta[, 2] == "")
missing[1:5]
## [1] 1167 1168 1169 1170 1171
rta3[head(missing), ]
## time X40010 X40015 X40020 X40025 X40030 X40035 X40040
## 1167 2010-03-04 01:16 NA NA NA NA NA NA NA
## 1168 2010-03-04 01:19 NA NA NA NA NA NA NA
## 1169 2010-03-04 01:22 NA NA NA NA NA NA NA
## 1170 2010-03-04 01:25 NA NA NA NA NA NA NA
## 1171 2010-03-04 01:28 NA NA NA NA NA NA NA
## 1172 2010-03-04 01:31 NA NA NA NA NA NA NA
## X40045 X40050 X40055 X40060 X40065 X40070 X40075 X40080 X40085 X40090
## 1167 NA NA NA NA NA NA NA NA NA NA
## 1168 NA NA NA NA NA NA NA NA NA NA
## 1169 NA NA NA NA NA NA NA NA NA NA
## 1170 NA NA NA NA NA NA NA NA NA NA
## 1171 NA NA NA NA NA NA NA NA NA NA
## 1172 NA NA NA NA NA NA NA NA NA NA
## X40092 X40095 X40100 X40105 X40110 X40115 X40120 X40125 X40130 X40135
## 1167 NA NA NA NA NA NA NA NA NA NA
## 1168 NA NA NA NA NA NA NA NA NA NA
## 1169 NA NA NA NA NA NA NA NA NA NA
## 1170 NA NA NA NA NA NA NA NA NA NA
## 1171 NA NA NA NA NA NA NA NA NA NA
## 1172 NA NA NA NA NA NA NA NA NA NA
## X40140 X40145 X40150 X41010 X41015 X41020 X41025 X41030 X41035 X41040
## 1167 NA NA NA NA NA NA NA NA NA NA
## 1168 NA NA NA NA NA NA NA NA NA NA
## 1169 NA NA NA NA NA NA NA NA NA NA
## 1170 NA NA NA NA NA NA NA NA NA NA
## 1171 NA NA NA NA NA NA NA NA NA NA
## 1172 NA NA NA NA NA NA NA NA NA NA
## X41045 X41050 X41055 X41060 X41065 X41070 X41075 X41080 X41085 X41090
## 1167 NA NA NA NA NA NA NA NA NA NA
## 1168 NA NA NA NA NA NA NA NA NA NA
## 1169 NA NA NA NA NA NA NA NA NA NA
## 1170 NA NA NA NA NA NA NA NA NA NA
## 1171 NA NA NA NA NA NA NA NA NA NA
## 1172 NA NA NA NA NA NA NA NA NA NA
## X41095 X41100 X41105 X41110 X41115 X41120 X41125 X41130 X41135 X41140
## 1167 NA NA NA NA NA NA NA NA NA NA
## 1168 NA NA NA NA NA NA NA NA NA NA
## 1169 NA NA NA NA NA NA NA NA NA NA
## 1170 NA NA NA NA NA NA NA NA NA NA
## 1171 NA NA NA NA NA NA NA NA NA NA
## 1172 NA NA NA NA NA NA NA NA NA NA
## X41145 X41150 X41155 X41160
## 1167 NA NA NA NA
## 1168 NA NA NA NA
## 1169 NA NA NA NA
## 1170 NA NA NA NA
## 1171 NA NA NA NA
## 1172 NA NA NA NA
It's good to first look at your data in plain text format outside of R and then to check it after you've read it into R.
The read.table() family of functions just skims the surface of things...
readLines()
, and then post-process it.read.fwf()
.We've already seen an example of reading data produced by another statistical package with read.dta
. There are a number of other formats that we can handle for either reading or writing. Let's see library(help = foreign)
.
R can also read in (and write out) Excel files, netCDF files, HDF5 files, etc., in many cases through add-on packages from CRAN.
A pause for a (gentle) diatribe:
Please try to avoid using Excel files as a data storage format. It's proprietary, complicated (can have multiple sheets), allows a limited number of rows/columns, and files are not easily readable/viewable (unlike simple text files).
Here you have a number of options.
pdf("myplot.pdf", width = 7, height = 7)
x <- rnorm(10)
y <- rnorm(10)
plot(x, y)
dev.off()
## pdf
## 2
xtable()
formats tables for HTML and Latex (the default).
require(xtable)
## Loading required package: xtable
print(xtable(table(vote$race, vote$pres04)), type = "html")
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Fri Aug 23 16:59:01 2013 -->
## <TABLE border=1>
## <TR> <TH> </TH> <TH> 0 </TH> <TH> 1 </TH> <TH> 2 </TH> <TH> 3 </TH> <TH> 4 </TH> <TH> 9 </TH> </TR>
## <TR> <TD align="right"> white </TD> <TD align="right"> 111 </TD> <TD align="right"> 26184 </TD> <TD align="right"> 33045 </TD> <TD align="right"> 417 </TD> <TD align="right"> 14 </TD> <TD align="right"> 409 </TD> </TR>
## <TR> <TD align="right"> black </TD> <TD align="right"> 18 </TD> <TD align="right"> 6183 </TD> <TD align="right"> 824 </TD> <TD align="right"> 56 </TD> <TD align="right"> 0 </TD> <TD align="right"> 21 </TD> </TR>
## <TR> <TD align="right"> hispanic/latino </TD> <TD align="right"> 6 </TD> <TD align="right"> 2665 </TD> <TD align="right"> 1639 </TD> <TD align="right"> 34 </TD> <TD align="right"> 3 </TD> <TD align="right"> 49 </TD> </TR>
## <TR> <TD align="right"> asian </TD> <TD align="right"> 0 </TD> <TD align="right"> 626 </TD> <TD align="right"> 384 </TD> <TD align="right"> 7 </TD> <TD align="right"> 1 </TD> <TD align="right"> 2 </TD> </TR>
## <TR> <TD align="right"> other </TD> <TD align="right"> 16 </TD> <TD align="right"> 1036 </TD> <TD align="right"> 653 </TD> <TD align="right"> 22 </TD> <TD align="right"> 0 </TD> <TD align="right"> 32 </TD> </TR>
## </TABLE>
Recall our discussion of the summary()
function used on either a vector or a regression object. What do you think is going on with call to print()
above?
Using the voting/presidential preference dataset, create a new column based on age9 that gives, as a numeric value, the midpoint of the age range assigned to each person. Try to do this with a combination of subsetting and string operations (i.e., can you convert the character numbers to actual numbers). To simplify things, feel free to get rid of the rows for ages "75 and over".
Go back to slide 6 on logical vectors and figure out what is going on in the last few lines of code.
Go back to slide 22 and figure out what is going on with that complicated last line of code.