August 2013, UC Berkeley
Chris Paciorek
Note to BB: start recording.
For some of the topics here, my goal is not to teach you how to fish, but merely to tell you that fish exist, they are delicious, that they can be caught, and where one might go to figure out how to catch them.
Confusingly, R has three different systems for OOP, and none of them are as elegant and powerful as in Python or other languages more focused on OOP. That said, they get the job done for a lot of tasks.
lm()
, glm()
, and many other core features in R in the stats packageThe basic idea is that coding is structured around objects, which belong to a class, and methods that operate on objects in the class.
Objects are like lists, but with methods that are specifically associated with particular classes, as we've seen with the lm
class.
Objects have fields, analogous to the components of a list. For S4 and reference classes, the fields of the class are fixed and all objects of the class must contain those fields.
library(methods)
yb <- sample(c(0, 1), 10, replace = TRUE)
yc <- rnorm(10)
x <- rnorm(10)
mod1 <- lm(yc ~ x)
mod2 <- glm(yb ~ x, family = binomial)
mod2$residuals # access field with list-like syntax
## 1 2 3 4 5 6 7 8 9 10
## -1.133 -1.113 -1.079 -1.169 2.526 -3.240 -1.074 -1.061 3.240 -1.043
class(mod2)
## [1] "glm" "lm"
is(mod2, "lm")
## [1] TRUE
is.list(mod2)
## [1] TRUE
names(mod2)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
methods(class = "glm")
## [1] add1.glm* anova.glm confint.glm*
## [4] cooks.distance.glm* deviance.glm* drop1.glm*
## [7] effects.glm* extractAIC.glm* family.glm*
## [10] formula.glm* influence.glm* logLik.glm*
## [13] model.frame.glm nobs.glm* predict.glm
## [16] print.glm residuals.glm rstandard.glm
## [19] rstudent.glm summary.glm vcov.glm*
## [22] weights.glm*
##
## Non-visible functions are asterisked
methods(predict)
## [1] predict.ar* predict.Arima*
## [3] predict.arima0* predict.glm
## [5] predict.HoltWinters* predict.lm
## [7] predict.loess* predict.mlm
## [9] predict.nls* predict.poly
## [11] predict.ppr* predict.prcomp*
## [13] predict.princomp* predict.smooth.spline*
## [15] predict.smooth.spline.fit* predict.StructTS*
##
## Non-visible functions are asterisked
predict
## function (object, ...)
## UseMethod("predict")
## <bytecode: 0x11224e8>
## <environment: namespace:stats>
# predict.glm
When predict()
is called on a GLM object, it first calls the generic predict()
, which then recognizes that the first argument is of the class glm and immediately calls the right class-specific method, predict.glm()
in this case.
Making an object and class-specific methods under S3 is simple.
rboot2013 <- list(month = "August", year = 2013, instructor = "Paciorek", attendance = 100)
class(rboot2013) <- "workshop"
rboot2013
## $month
## [1] "August"
##
## $year
## [1] 2013
##
## $instructor
## [1] "Paciorek"
##
## $attendance
## [1] 100
##
## attr(,"class")
## [1] "workshop"
is(rboot2013, "workshop")
## [1] TRUE
rboot2013$instructor
## [1] "Paciorek"
print.workshop <- function(x) {
with(x, cat("A workshop held in ", month, " ", year, "; taught by ", instructor,
".\nThe attendance was ", attendance, ".\n", sep = ""))
invisible(x)
}
rboot2013 # doesn't execute correctly in the slide creation
## $month
## [1] "August"
##
## $year
## [1] 2013
##
## $instructor
## [1] "Paciorek"
##
## $attendance
## [1] 100
##
## attr(,"class")
## [1] "workshop"
Note that we rely on the generic print()
already existing in R. Otherwise we'd need to create it.
So what is happening behind the scenes here?
Unlike S4, S4 classes have a formal definition and objects in the class must have the specified fields.
The fields of an S4 class are called 'slots'. Instead of x$field
you do x@field
.
Here's a bit of an example of an S4 class for a linear mixed effects model from lme4:
require(lme4)
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: lattice
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
##
## AIC, BIC
require(methods)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
class(fm1)
## [1] "mer"
## attr(,"package")
## [1] "lme4"
methods(class = "mer")
## [1] drop1.mer* extractAIC.mer* isGLMM.mer* isLMM.mer*
## [5] isNLMM.mer* isREML.mer* terms.mer*
##
## Non-visible functions are asterisked
slotNames(fm1)
## [1] "env" "nlmodel" "frame" "call" "flist" "X"
## [7] "Xst" "Zt" "pWt" "offset" "y" "Gp"
## [13] "dims" "ST" "V" "A" "Cm" "Cx"
## [19] "L" "deviance" "fixef" "ranef" "u" "eta"
## [25] "mu" "muEta" "var" "resid" "sqrtXWt" "sqrtrWt"
## [31] "RZX" "RX" "ghx" "ghw"
fm1$ranef
## Error: $ operator not defined for this S4 class
fm1@ranef
## [1] 2.2572 -40.3984 -38.9605 23.6919 22.2613 9.0398 16.8410
## [8] -7.2330 -0.3320 34.8900 -25.2110 -13.0714 4.5784 20.8636
## [15] 3.2754 -25.6144 0.8072 12.3147 9.1993 -8.6211 -5.4502
## [22] -4.8137 -3.0693 -0.2719 -0.2231 1.0744 -10.7524 8.6296
## [29] 1.1727 6.6140 -3.0152 3.5367 0.8723 4.8218 -0.9882
## [36] 1.2844
Reference classes are new in the last few years; for more information, see ?ReferenceClasses
.
Reference classes are somewhat like S4 in that a class is formally defined and R is careful about the fields of the class.
Methods for reference classes can modify the contents of fields in an object invisibly without the need to copy over the entire object.
When you write your own functions, and particularly for distributing to others, it's a good idea to:
We can use stop()
and warning()
to do this. They're the same functions that are being called when you see an error message or a warning in reaction to your own work in R.
mysqrt <- function(x) {
if (is.list(x)) {
warning("x is a list; converting to a vector")
x <- unlist(x)
}
if (!is.numeric(x)) {
stop("What is the square root of 'bob'?")
} else {
if (any(x < 0)) {
warning("mysqrt: found negative values; proceeding anyway")
x[x >= 0] <- (x[x >= 0])^(1/2)
x[x < 0] <- NaN
return(x)
} else return(x^(1/2))
}
}
mysqrt(c(1, 2, 3))
## [1] 1.000 1.414 1.732
mysqrt(c(5, -7))
## Warning: mysqrt: found negative values; proceeding anyway
## [1] 2.236 NaN
mysqrt(c("asdf", "sdf"))
## Error: What is the square root of 'bob'?
mysqrt(list(5, 3, "ab"))
## Warning: x is a list; converting to a vector
## Error: What is the square root of 'bob'?
sqrt(c(5, -7))
## Warning: NaNs produced
## [1] 2.236 NaN
sqrt("asdf")
## Error: non-numeric argument to mathematical function
sqrt(list(5, 3, 2))
## Error: non-numeric argument to mathematical function
So we've done something similar to what sqrt()
actually does in R.
When you automate analyses, sometimes an R function call will fail. But you don't want all of your analyses to grind to a halt because one failed. Rather, you want to catch the error, record that it failed, and move on.
For me this is most critical when I'm doing stratified analyses or sequential operations.
The try()
function is a powerful tool here.
try()
Suppose we tried to do a stratified analysis of earnings on height within education levels. I'm going to do this as a for loop for pedagogical reasons, but again, it would be better to do this with apply/by/plyr type tools.
mod <- list()
for (edLevel in unique(earnings$ed)) {
print(edLevel)
sub <- subset(earnings, ed == edLevel)
mod[[edLevel]] <- lm(earn ~ height, data = sub)
}
## [1] 12
## [1] 16
## [1] 17
## [1] 18
## [1] 15
## [1] 99
## Error: 0 (non-NA) cases
try()
hardermod <- list()
for (edLevel in unique(earnings$ed)) {
print(edLevel)
sub <- subset(earnings, ed == edLevel)
tmp <- try(lm(earn ~ height, data = sub))
if (is(tmp, "try-error"))
mod[[edLevel]] <- NA else mod[[edLevel]] <- tmp
}
## [1] 12
## [1] 16
## [1] 17
## [1] 18
## [1] 15
## [1] 99
## [1] 14
## [1] 11
## [1] 10
## [1] 13
## [1] 9
## [1] 4
## [1] 8
## [1] 6
## [1] 5
## [1] 98
## [1] 7
## [1] 2
## [1] 3
mod[[2]]
## [1] NA
mod[[3]]
##
## Call:
## lm(formula = earn ~ height, data = sub)
##
## Coefficients:
## (Intercept) height
## 1400 NA
One of the powerful capabilities you have in R is the ability to use R to modify and create R code.
First we need to understand a bit about how R code is stored and manipulated when we don't want to immediately evaluate it.
When you send some code to R to execute, it has to 'parse' the input; i.e., to process it so that it know how to evaluate it. The parsed input can then be evaluated in the proper context (i.e., the right frame, holding the objects to be operated on and created).
We can capture parsed code before it is evaluated, manipulate it, and execute the modified result.
code <- quote(n <- 100)
code
## n <- 100
class(code)
## [1] "<-"
n
## Error: object 'n' not found
eval(code)
n
## [1] 100
results <- rep(0, n)
moreCode <- quote(for (i in 1:n) {
tmp <- rnorm(30)
results[i] <- min(tmp)
})
class(moreCode)
## [1] "for"
as.list(moreCode)
## [[1]]
## `for`
##
## [[2]]
## i
##
## [[3]]
## 1:n
##
## [[4]]
## {
## tmp <- rnorm(30)
## results[i] <- min(tmp)
## }
newN <- 200
codeText <- paste("n", "<-", newN)
codeText
## [1] "n <- 200"
codeFromText <- parse(text = codeText)
codeFromText
## expression(n <- 200)
eval(codeFromText)
n
## [1] 200
So you could use R's string manipulation capabilities to write and then evaluate R code. Meta.
Suppose you were given a bunch of objects named "x1", "x2", "x3", ... and you wanted to write code to automatically do some computation on them. Here I'll just demonstrate with three, but this is obviously more compelling if there are many of them.
# assume these objects were provided to you
x1 <- rnorm(5)
x2 <- rgamma(10, 1)
x3 <- runif(20)
# now you want to work with them
nVals <- 3
results <- rep(0, nVals)
for (i in 1:nVals) {
varName <- paste("x", i, sep = "")
tmp <- eval(as.name(varName))
# tmp <- get(varName) # an alternative
results[i] <- mean(tmp)
}
results
## [1] -0.4859 1.0351 0.5120
varName
## [1] "x3"
tmp
## [1] 0.428605 0.602512 0.633760 0.085317 0.747923 0.109254 0.191747
## [8] 0.009666 0.347557 0.299755 0.599203 0.804584 0.703254 0.566804
## [15] 0.990518 0.750419 0.458876 0.640811 0.598365 0.670340
Or suppose you needed to create "x1", "x2", "x3", automatically.
nVals <- 3
results <- rep(0, nVals)
for (i in 1:nVals) {
varName <- paste("x", i, sep = "")
assign(varName, rnorm(10))
}
x2
## [1] -0.07520 2.18524 -0.82398 -0.26709 -1.34836 0.58760 -0.62308
## [8] -0.09177 -1.55000 -0.44705
Can you think of any uses of this ability for R to self-generate?
Text (either in the form of a file with regular language in it or a data file with fields of character strings) will often contain characters that are not part of the limited ASCII set of characters, which has 128 characters and control codes; basically what you see on a standard US keyboard.
UTF-8 is an encoding for the Unicode characters that include more than 110,000 characters from 100 different alphabets/scripts. It's widely used on the web.
Latin-1 encodes a small subset of Unicode and contains the characters used in many European languages (e.g., letters with accents).
To read files with other characters correctly into R, you may need to tell R what encoding the file is in. E.g., see help on read.table()
for the fileEncoding and encoding arguments.
With strings already in R, you can convert between encodings with iconv()
:
text <- "_Melhore sua seguran\xe7a_"
iconv(text, from = "latin1", to = "UTF-8")
## [1] "_Melhore sua segurança_"
iconv(text, from = "latin1", to = "ASCII", sub = "???")
## [1] "_Melhore sua seguran???a_"
You can mark a string with an encoding so R can display it correctly:
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x
## [1] "façile"
# playing around...
x <- "\xa1 \xa2 \xa3 \xf1 \xf2"
Encoding(x) <- "latin1"
x
## [1] "¡ ¢ £ ñ ò"
Windows, Mac, and Linux handle line endings in text files somewhat differently. So if you read a text file into R that was created in a different operating system you can run into difficulties.
\n
) and a carriage return (\r
).So in UNIX you might see ^M
at the end of lines when you open a Windows file in a text editor. The dos2unix or fromdos in UNIX commands can do the necessary conversion
In Windows you might have a UNIX text file appear to be all one line. The unix2dos or todos commands in UNIX can do the conversion.
There may also be Windows tools to deal with this.
As a side note, before Mac OS X, lines may ended only in a carriage return. There is a UNIX utility call mac2unix that can convert such text files.
R has the capability to read and write from a variety of relational database management systems (DBMS). Basically a database is a collection of rectangular format datasets (tables). Some of these tables have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information.
The DBI package provides a front-end for manipulating databases from a variety of DBMS (MySQL, SQLite, Oracle, among others)
Basically, you tell the package what DBMS is being used on the back-end, link to the actual database, and then you can use the standard functions in the package regardless of the back-end.
The Current Index to Statistics contains records of article and author information from most Statistics journals.
The database is not freely available, so it's not on the course bSpace or Github, but I can do a demo here. (This may not process correctly when producing these slides...)
library(RSQLite) # DBI is a dependency
## Loading required package: DBI
db <- dbConnect(SQLite(), dbname = "../data/cis.db")
# cis.db is an SQLite database
dbListTables(db)
## [1] "articles" "authors" "authorships" "books"
## [5] "contacts" "delayed_jobs" "isbns" "issns"
## [9] "issues" "journals" "tag_relations" "taggings"
## [13] "tags" "volumes"
dbListFields(db, "articles")
## [1] "id" "type" "id_entity" "id_title" "title"
## [6] "year" "volume" "number" "page_start" "page_end"
## [11] "url" "journal" "journal_id" "volume_id" "issue_id"
## [16] "zmath"
breiman <- dbGetQuery(db, "select * from authors\nwhere name like 'Breiman%'")
breiman
## id name
## 1 532 Breiman, Leo\n
## 2 1141 Breiman, L.\n
Does anyone notice any grammatical issues in this Dilbert strip?
Please fill out the feedback form! https://docs.google.com/spreadsheet/viewform?formkey=dHBaaW5GdVBESWJDUnp6enZNdXpIeUE6MA
Have a snack!