R bootcamp, Module 10: Advanced topics

August 2013, UC Berkeley

Chris Paciorek

This purpose of this module

Note to BB: start recording.

For some of the topics here, my goal is not to teach you how to fish, but merely to tell you that fish exist, they are delicious, that they can be caught, and where one might go to figure out how to catch them.

Object-oriented programming (OOP) in R

Confusingly, R has three different systems for OOP, and none of them are as elegant and powerful as in Python or other languages more focused on OOP. That said, they get the job done for a lot of tasks.

Basics of object-oriented programming (OOP)

The basic idea is that coding is structured around objects, which belong to a class, and methods that operate on objects in the class.

Objects are like lists, but with methods that are specifically associated with particular classes, as we've seen with the lm class.

Objects have fields, analogous to the components of a list. For S4 and reference classes, the fields of the class are fixed and all objects of the class must contain those fields.

Working with S3 classes and methods

library(methods)
yb <- sample(c(0, 1), 10, replace = TRUE)
yc <- rnorm(10)
x <- rnorm(10)
mod1 <- lm(yc ~ x)
mod2 <- glm(yb ~ x, family = binomial)
mod2$residuals  # access field with list-like syntax
##      1      2      3      4      5      6      7      8      9     10 
## -1.133 -1.113 -1.079 -1.169  2.526 -3.240 -1.074 -1.061  3.240 -1.043

class(mod2)
## [1] "glm" "lm"
is(mod2, "lm")
## [1] TRUE
is.list(mod2)
## [1] TRUE
names(mod2)
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"

methods(class = "glm")
##  [1] add1.glm*           anova.glm           confint.glm*       
##  [4] cooks.distance.glm* deviance.glm*       drop1.glm*         
##  [7] effects.glm*        extractAIC.glm*     family.glm*        
## [10] formula.glm*        influence.glm*      logLik.glm*        
## [13] model.frame.glm     nobs.glm*           predict.glm        
## [16] print.glm           residuals.glm       rstandard.glm      
## [19] rstudent.glm        summary.glm         vcov.glm*          
## [22] weights.glm*       
## 
##    Non-visible functions are asterisked

methods(predict)
##  [1] predict.ar*                predict.Arima*            
##  [3] predict.arima0*            predict.glm               
##  [5] predict.HoltWinters*       predict.lm                
##  [7] predict.loess*             predict.mlm               
##  [9] predict.nls*               predict.poly              
## [11] predict.ppr*               predict.prcomp*           
## [13] predict.princomp*          predict.smooth.spline*    
## [15] predict.smooth.spline.fit* predict.StructTS*         
## 
##    Non-visible functions are asterisked

predict
## function (object, ...) 
## UseMethod("predict")
## <bytecode: 0x11224e8>
## <environment: namespace:stats>

# predict.glm

When predict() is called on a GLM object, it first calls the generic predict(), which then recognizes that the first argument is of the class glm and immediately calls the right class-specific method, predict.glm() in this case.

Making your own S3 class/object/method

Making an object and class-specific methods under S3 is simple.

rboot2013 <- list(month = "August", year = 2013, instructor = "Paciorek", attendance = 100)
class(rboot2013) <- "workshop"

rboot2013
## $month
## [1] "August"
## 
## $year
## [1] 2013
## 
## $instructor
## [1] "Paciorek"
## 
## $attendance
## [1] 100
## 
## attr(,"class")
## [1] "workshop"
is(rboot2013, "workshop")
## [1] TRUE
rboot2013$instructor
## [1] "Paciorek"

print.workshop <- function(x) {
    with(x, cat("A workshop held in ", month, " ", year, "; taught by ", instructor, 
        ".\nThe attendance was ", attendance, ".\n", sep = ""))
    invisible(x)
}
rboot2013  # doesn't execute correctly in the slide creation
## $month
## [1] "August"
## 
## $year
## [1] 2013
## 
## $instructor
## [1] "Paciorek"
## 
## $attendance
## [1] 100
## 
## attr(,"class")
## [1] "workshop"

Note that we rely on the generic print() already existing in R. Otherwise we'd need to create it.

So what is happening behind the scenes here?

Using S4 classes and methods

Unlike S4, S4 classes have a formal definition and objects in the class must have the specified fields.

The fields of an S4 class are called 'slots'. Instead of x$field you do x@field.

Here's a bit of an example of an S4 class for a linear mixed effects model from lme4:

require(lme4)
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: lattice
## Attaching package: 'lme4'
## The following object is masked from 'package:stats':
## 
## AIC, BIC
require(methods)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
class(fm1)
## [1] "mer"
## attr(,"package")
## [1] "lme4"
methods(class = "mer")
## [1] drop1.mer*      extractAIC.mer* isGLMM.mer*     isLMM.mer*     
## [5] isNLMM.mer*     isREML.mer*     terms.mer*     
## 
##    Non-visible functions are asterisked
slotNames(fm1)
##  [1] "env"      "nlmodel"  "frame"    "call"     "flist"    "X"       
##  [7] "Xst"      "Zt"       "pWt"      "offset"   "y"        "Gp"      
## [13] "dims"     "ST"       "V"        "A"        "Cm"       "Cx"      
## [19] "L"        "deviance" "fixef"    "ranef"    "u"        "eta"     
## [25] "mu"       "muEta"    "var"      "resid"    "sqrtXWt"  "sqrtrWt" 
## [31] "RZX"      "RX"       "ghx"      "ghw"
fm1$ranef
## Error: $ operator not defined for this S4 class
fm1@ranef
##  [1]   2.2572 -40.3984 -38.9605  23.6919  22.2613   9.0398  16.8410
##  [8]  -7.2330  -0.3320  34.8900 -25.2110 -13.0714   4.5784  20.8636
## [15]   3.2754 -25.6144   0.8072  12.3147   9.1993  -8.6211  -5.4502
## [22]  -4.8137  -3.0693  -0.2719  -0.2231   1.0744 -10.7524   8.6296
## [29]   1.1727   6.6140  -3.0152   3.5367   0.8723   4.8218  -0.9882
## [36]   1.2844

A brief mention of Reference Classes

Reference classes are new in the last few years; for more information, see ?ReferenceClasses.

Reference classes are somewhat like S4 in that a class is formally defined and R is careful about the fields of the class.

Methods for reference classes can modify the contents of fields in an object invisibly without the need to copy over the entire object.

Error and warning messages

When you write your own functions, and particularly for distributing to others, it's a good idea to:

We can use stop() and warning() to do this. They're the same functions that are being called when you see an error message or a warning in reaction to your own work in R.

mysqrt <- function(x) {
    if (is.list(x)) {
        warning("x is a list; converting to a vector")
        x <- unlist(x)
    }
    if (!is.numeric(x)) {
        stop("What is the square root of 'bob'?")
    } else {
        if (any(x < 0)) {
            warning("mysqrt: found negative values; proceeding anyway")
            x[x >= 0] <- (x[x >= 0])^(1/2)
            x[x < 0] <- NaN
            return(x)
        } else return(x^(1/2))
    }
}

mysqrt(c(1, 2, 3))
## [1] 1.000 1.414 1.732
mysqrt(c(5, -7))
## Warning: mysqrt: found negative values; proceeding anyway
## [1] 2.236   NaN
mysqrt(c("asdf", "sdf"))
## Error: What is the square root of 'bob'?
mysqrt(list(5, 3, "ab"))
## Warning: x is a list; converting to a vector
## Error: What is the square root of 'bob'?
sqrt(c(5, -7))
## Warning: NaNs produced
## [1] 2.236   NaN
sqrt("asdf")
## Error: non-numeric argument to mathematical function
sqrt(list(5, 3, 2))
## Error: non-numeric argument to mathematical function

So we've done something similar to what sqrt() actually does in R.

'Catching' errors

When you automate analyses, sometimes an R function call will fail. But you don't want all of your analyses to grind to a halt because one failed. Rather, you want to catch the error, record that it failed, and move on.

For me this is most critical when I'm doing stratified analyses or sequential operations.

The try() function is a powerful tool here.

Why we need to try()

Suppose we tried to do a stratified analysis of earnings on height within education levels. I'm going to do this as a for loop for pedagogical reasons, but again, it would be better to do this with apply/by/plyr type tools.

mod <- list()
for (edLevel in unique(earnings$ed)) {
    print(edLevel)
    sub <- subset(earnings, ed == edLevel)
    mod[[edLevel]] <- lm(earn ~ height, data = sub)
}
## [1] 12
## [1] 16
## [1] 17
## [1] 18
## [1] 15
## [1] 99
## Error: 0 (non-NA) cases
  1. What happened?
  2. Why did it go through the education levels in that order? (12, 16, 17, ...)

How we can try() harder

mod <- list()
for (edLevel in unique(earnings$ed)) {
    print(edLevel)
    sub <- subset(earnings, ed == edLevel)
    tmp <- try(lm(earn ~ height, data = sub))
    if (is(tmp, "try-error")) 
        mod[[edLevel]] <- NA else mod[[edLevel]] <- tmp
}
## [1] 12
## [1] 16
## [1] 17
## [1] 18
## [1] 15
## [1] 99
## [1] 14
## [1] 11
## [1] 10
## [1] 13
## [1] 9
## [1] 4
## [1] 8
## [1] 6
## [1] 5
## [1] 98
## [1] 7
## [1] 2
## [1] 3

mod[[2]]
## [1] NA
mod[[3]]
## 
## Call:
## lm(formula = earn ~ height, data = sub)
## 
## Coefficients:
## (Intercept)       height  
##        1400           NA

Computing on the language

One of the powerful capabilities you have in R is the ability to use R to modify and create R code.

First we need to understand a bit about how R code is stored and manipulated when we don't want to immediately evaluate it.

When you send some code to R to execute, it has to 'parse' the input; i.e., to process it so that it know how to evaluate it. The parsed input can then be evaluated in the proper context (i.e., the right frame, holding the objects to be operated on and created).

We can capture parsed code before it is evaluated, manipulate it, and execute the modified result.

Capturing and evaluating parsed code

code <- quote(n <- 100)
code
## n <- 100
class(code)
## [1] "<-"
n
## Error: object 'n' not found

eval(code)
n
## [1] 100

results <- rep(0, n)
moreCode <- quote(for (i in 1:n) {
    tmp <- rnorm(30)
    results[i] <- min(tmp)
})
class(moreCode)
## [1] "for"
as.list(moreCode)
## [[1]]
## `for`
## 
## [[2]]
## i
## 
## [[3]]
## 1:n
## 
## [[4]]
## {
##     tmp <- rnorm(30)
##     results[i] <- min(tmp)
## }

newN <- 200
codeText <- paste("n", "<-", newN)
codeText
## [1] "n <- 200"
codeFromText <- parse(text = codeText)
codeFromText
## expression(n <- 200)
eval(codeFromText)
n
## [1] 200

So you could use R's string manipulation capabilities to write and then evaluate R code. Meta.

Using R to automate working with object names

Suppose you were given a bunch of objects named "x1", "x2", "x3", ... and you wanted to write code to automatically do some computation on them. Here I'll just demonstrate with three, but this is obviously more compelling if there are many of them.

# assume these objects were provided to you
x1 <- rnorm(5)
x2 <- rgamma(10, 1)
x3 <- runif(20)
# now you want to work with them
nVals <- 3
results <- rep(0, nVals)
for (i in 1:nVals) {
    varName <- paste("x", i, sep = "")
    tmp <- eval(as.name(varName))
    # tmp <- get(varName) # an alternative
    results[i] <- mean(tmp)
}
results
## [1] -0.4859  1.0351  0.5120
varName
## [1] "x3"
tmp
##  [1] 0.428605 0.602512 0.633760 0.085317 0.747923 0.109254 0.191747
##  [8] 0.009666 0.347557 0.299755 0.599203 0.804584 0.703254 0.566804
## [15] 0.990518 0.750419 0.458876 0.640811 0.598365 0.670340

Or suppose you needed to create "x1", "x2", "x3", automatically.

nVals <- 3
results <- rep(0, nVals)
for (i in 1:nVals) {
    varName <- paste("x", i, sep = "")
    assign(varName, rnorm(10))
}
x2
##  [1] -0.07520  2.18524 -0.82398 -0.26709 -1.34836  0.58760 -0.62308
##  [8] -0.09177 -1.55000 -0.44705

Can you think of any uses of this ability for R to self-generate?

File encodings

Text (either in the form of a file with regular language in it or a data file with fields of character strings) will often contain characters that are not part of the limited ASCII set of characters, which has 128 characters and control codes; basically what you see on a standard US keyboard.

UTF-8 is an encoding for the Unicode characters that include more than 110,000 characters from 100 different alphabets/scripts. It's widely used on the web.

Latin-1 encodes a small subset of Unicode and contains the characters used in many European languages (e.g., letters with accents).

Dealing with encodings in R

To read files with other characters correctly into R, you may need to tell R what encoding the file is in. E.g., see help on read.table() for the fileEncoding and encoding arguments.

With strings already in R, you can convert between encodings with iconv():

text <- "_Melhore sua seguran\xe7a_"
iconv(text, from = "latin1", to = "UTF-8")
## [1] "_Melhore sua segurança_"
iconv(text, from = "latin1", to = "ASCII", sub = "???")
## [1] "_Melhore sua seguran???a_"

You can mark a string with an encoding so R can display it correctly:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x
## [1] "façile"

# playing around...
x <- "\xa1 \xa2 \xa3 \xf1 \xf2"
Encoding(x) <- "latin1"
x
## [1] "¡ ¢ £ ñ ò"

Line endings in text files

Windows, Mac, and Linux handle line endings in text files somewhat differently. So if you read a text file into R that was created in a different operating system you can run into difficulties.

So in UNIX you might see ^M at the end of lines when you open a Windows file in a text editor. The dos2unix or fromdos in UNIX commands can do the necessary conversion

In Windows you might have a UNIX text file appear to be all one line. The unix2dos or todos commands in UNIX can do the conversion.

There may also be Windows tools to deal with this.

As a side note, before Mac OS X, lines may ended only in a carriage return. There is a UNIX utility call mac2unix that can convert such text files.

Working with databases

R has the capability to read and write from a variety of relational database management systems (DBMS). Basically a database is a collection of rectangular format datasets (tables). Some of these tables have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information.

The DBI package provides a front-end for manipulating databases from a variety of DBMS (MySQL, SQLite, Oracle, among others)

Basically, you tell the package what DBMS is being used on the back-end, link to the actual database, and then you can use the standard functions in the package regardless of the back-end.

Database example

The Current Index to Statistics contains records of article and author information from most Statistics journals.

The database is not freely available, so it's not on the course bSpace or Github, but I can do a demo here. (This may not process correctly when producing these slides...)

library(RSQLite)  # DBI is a dependency
## Loading required package: DBI
db <- dbConnect(SQLite(), dbname = "../data/cis.db")
# cis.db is an SQLite database

dbListTables(db)
##  [1] "articles"      "authors"       "authorships"   "books"        
##  [5] "contacts"      "delayed_jobs"  "isbns"         "issns"        
##  [9] "issues"        "journals"      "tag_relations" "taggings"     
## [13] "tags"          "volumes"
dbListFields(db, "articles")
##  [1] "id"         "type"       "id_entity"  "id_title"   "title"     
##  [6] "year"       "volume"     "number"     "page_start" "page_end"  
## [11] "url"        "journal"    "journal_id" "volume_id"  "issue_id"  
## [16] "zmath"
breiman <- dbGetQuery(db, "select * from authors\nwhere name like 'Breiman%'")
breiman
##     id           name
## 1  532 Breiman, Leo\n
## 2 1141  Breiman, L.\n

Breakout

Does anyone notice any grammatical issues in this Dilbert strip?

Please fill out the feedback form! https://docs.google.com/spreadsheet/viewform?formkey=dHBaaW5GdVBESWJDUnp6enZNdXpIeUE6MA

Have a snack!