Optimization

Note to BB: remember to start recording.

R provides functionality for optimization - finding maxima or minima of a function.

A workhorse is optim(), which implements a number of optimization algorithms.

require(fields)
banana <- function(x) {
    ## Rosenbrock Banana function
    x1 <- x[1]
    x2 <- x[2]
    100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}

x1s <- x2s <- seq(-5, 5, length = 100)
x <- expand.grid(x1s, x2s)
fx <- apply(x, 1, banana)

par(mfrow = c(1, 2), mai = c(0.45, 0.4, 0.1, 0.4))
image.plot(x1s, x2s, matrix(fx, 100), xlab = "", ylab = "")
image.plot(x1s, x2s, matrix(log(fx), 100), xlab = "", ylab = "")

plot of chunk unnamed-chunk-2


optim(c(-2, 0), banana)
## $par
## [1] 1.003 1.006
## 
## $value
## [1] 2.08e-05
## 
## $counts
## function gradient 
##      181       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

We can see the progression of evaluations of the objective function:

banana <- function(x) {
    ## Rosenbrock Banana function
    points(x[1], x[2])
    Sys.sleep(0.03)
    x1 <- x[1]
    x2 <- x[2]
    100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}
par(mfrow = c(1, 1), mai = c(0.45, 0.4, 0.1, 0.4))
image.plot(x1s, x2s, matrix(log(fx), 100), xlab = "", ylab = "")
optim(c(-2, 0), banana)

Smoothing

Linear regression and GLMs are of course useful, but often the relationship is not linear, even on some transformed scale.

Additive models and generalized additive models (GAMs) are the more flexible variants on linear models and GLMs.

There are a variety of tools in R for modeling nonlinear and smooth relationships, mirroring the variety of methods in the literature.

One workhorse is gam() in the mgcv package.

GAM in action

Let's consider height in the earnings dataset.

Any hypotheses about the relationship of earnings with height and education?

library(mgcv)
runif(1)  # not clear why this is needed; knitr throws error w/out it
## [1] 0.2015
mod_male <- gam(earn ~ s(height, k = 10) + s(ed, k = 10), data = earnings[earnings$sex == 
    1, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
mod_female <- gam(earn ~ s(height, k = 10) + s(ed, k = 10), data = earnings[earnings$sex == 
    2, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
summary(mod_male)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10)
## <environment: 0xdb64c8>
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    28927        940    30.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(height) 1.00   1.00  3.11   0.078 .  
## s(ed)     4.93   5.94 17.04  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.168   Deviance explained = 17.8%
## GCV score = 4.6545e+08  Scale est. = 4.5925e+08  n = 520
summary(mod_female)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10)
## <environment: 0xdb64c8>
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    14620        467    31.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(height) 1.00   1.00  0.51    0.47    
## s(ed)     2.16   2.73 47.62  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.134   Deviance explained = 13.7%
## GCV score = 1.8821e+08  Scale est. = 1.873e+08  n = 859
par(mfrow = c(1, 2))
plot(mod_male)

plot of chunk gamExample

plot(mod_female)

plot of chunk gamExample

A bit more model-building

Suppose we want to account for race in the model.

mod_male_full <- gam(earn ~ s(height, k = 10) + s(ed, k = 10) + race, data = earnings[earnings$sex == 
    1, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
mod_female_full <- gam(earn ~ s(height, k = 10) + s(ed, k = 10) + race, data = earnings[earnings$sex == 
    2, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
summary(mod_male_full)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10) + race
## <environment: 0xdb64c8>
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    34098       2165   15.75   <2e-16 ***
## race           -4459       1684   -2.65   0.0083 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df    F p-value    
## s(height) 1.00   1.00  2.1    0.15    
## s(ed)     4.81   5.82 17.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.177   Deviance explained = 18.8%
## GCV score = 4.6118e+08  Scale est. = 4.5425e+08  n = 520
summary(mod_female_full)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10) + race
## <environment: 0xdb64c8>
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13748        928   14.81   <2e-16 ***
## race             741        682    1.09     0.28    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(height) 1.00   1.00  0.64    0.43    
## s(ed)     2.74   3.44 37.69  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.134   Deviance explained = 13.9%
## GCV score = 1.8844e+08  Scale est. = 1.8718e+08  n = 859
par(mfrow = c(1, 2))
plot(mod_male_full)

plot of chunk gamExampleFull

plot(mod_female_full)

plot of chunk gamExampleFull

Comments?

Distributions

Since R was developed by statisticians, it handles distributions and simulation seamlessly.

All commonly-used distributions have functions in R. Each distribution has a family of functions:

Some of the distributions include the following (in the form of their random number generator function): rnorm(), runif(), rbinom(), rpois(), rbeta(), rgamma(), rt().

Distributions in action

pnorm(1.96)
## [1] 0.975
qnorm(0.975)
## [1] 1.96
dbinom(0:10, size = 10, prob = 0.3)
##  [1] 2.825e-02 1.211e-01 2.335e-01 2.668e-01 2.001e-01 1.029e-01 3.676e-02
##  [8] 9.002e-03 1.447e-03 1.378e-04 5.905e-06
dnorm(5)
## [1] 1.487e-06
dt(5, df = 1)
## [1] 0.01224

x <- seq(-5, 5, length = 100)
plot(x, dnorm(x), type = "l")
lines(x, dt(x, df = 1), col = "red")

plot of chunk unnamed-chunk-4

rmultinom(1, 100, prob = c(0.1, 0.1, 0.2, 0.3, 0.25, 0.05))
##      [,1]
## [1,]   12
## [2,]    8
## [3,]   25
## [4,]   26
## [5,]   26
## [6,]    3

x <- seq(0, 10, length = 100)
plot(x, dchisq(x, df = 1), type = "l")
lines(x, dchisq(x, df = 2), col = "red")

plot of chunk unnamed-chunk-5

Other types of simulation and sampling

We can draw a sample with or without replacement. Here's an example of some code that would be part of coding up a bootstrap.

sample(row.names(state.x77), 7, replace = FALSE)
## [1] "Oklahoma"       "Kansas"         "Florida"        "Mississippi"   
## [5] "Missouri"       "Maryland"       "North Carolina"
mean(earnings$earn, na.rm = TRUE)
## [1] 20015
# here's a bootstrap sample:
smp <- sample(seq_len(nrow(earnings)), replace = TRUE)
mean(earnings$earn[smp], na.rm = TRUE)
## [1] 20798

It's a good idea to use seq_along() and seq_len() and not syntax like 1:length(earnings) in sample() because the outcome of length() might in some cases be unexpected (e.g., if you're taking subsets of a dataset). Similar reasoning holds when setting up for loops: e.g.,

for (i in seq_len(nrow(earnings))) {
    # blah
}

The Random Seed

A few key facts about generating random numbers

To replicate any work involving random numbers, make sure to set the seed first.

set.seed(0)
vals <- rnorm(10)
vals
##  [1]  1.262954 -0.326233  1.329799  1.272429  0.414641 -1.539950 -0.928567
##  [8] -0.294720 -0.005767  2.404653
vals <- rnorm(10)
vals
##  [1]  0.7636 -0.7990 -1.1477 -0.2895 -0.2992 -0.4115  0.2522 -0.8919
##  [9]  0.4357 -1.2375
set.seed(0)
vals <- rnorm(10)
vals
##  [1]  1.262954 -0.326233  1.329799  1.272429  0.414641 -1.539950 -0.928567
##  [8] -0.294720 -0.005767  2.404653

Dates

date1 <- as.Date("03-01-2011", format = "%m-%d-%Y")
date2 <- as.Date("03/02/11", format = "%m/%d/%y")
date3 <- as.Date("07-May-11", format = "%d-%b-%y")

date1
## [1] "2011-03-01"
date2
## [1] "2011-03-02"
class(date1)
## [1] "Date"
dates <- c(date1, date2, date3)
weekdays(dates)
## [1] "Tuesday"   "Wednesday" "Saturday"
dates + 30
## [1] "2011-03-31" "2011-04-01" "2011-06-06"
date3 - date2
## Time difference of 66 days
unclass(dates)
## [1] 15034 15035 15101

Time too!

library(chron)
d1 <- chron("12/25/2004", "10:37:59")
# default format of m/d/Y and h:m:s
d2 <- chron("12/26/2004", "11:37:59")

class(d1)
## [1] "chron" "dates" "times"
d1
## [1] (12/25/04 10:37:59)
d1 + 33
## [1] (01/27/05 10:37:59)
d2 - d1
## Time in days:
## [1] 1.042
d1 + d2
## Error: chron objects may not be added together

There's lots more packages/functionality for dates/times: see lubridate and ?DateTimeClasses

Breakout

  1. Suppose you wanted to do 10-fold cross-validation for some sort of regression model fit to the earnings dataset. Write some R code that produces a field in the dataset that indicates which fold each observation is in. Ensure each of the folds has an equal (or as nearly equal as possible if the number of observations is not divisible by 10) number of observations. Hint: consider the times argument to the rep() function.
  2. Write some code to demonstrate the central limit theorem. Generate many different replicates of samples of size n from a skewed or discrete distribution and show that if n is big enough, the distribution of the means (of each sample of size n) looks approximately normal in a histogram. Do it without any looping!

I.e., I want you to show that if you have a large number (say 10,000) of means, each mean being the mean of n values from a distribution, the distribution of the means looks approximately normal if n is sufficiently big.