Note to BB: remember to start recording.
R provides functionality for optimization - finding maxima or minima of a function.
A workhorse is optim()
, which implements a number of optimization algorithms.
require(fields)
banana <- function(x) {
## Rosenbrock Banana function
x1 <- x[1]
x2 <- x[2]
100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}
x1s <- x2s <- seq(-5, 5, length = 100)
x <- expand.grid(x1s, x2s)
fx <- apply(x, 1, banana)
par(mfrow = c(1, 2), mai = c(0.45, 0.4, 0.1, 0.4))
image.plot(x1s, x2s, matrix(fx, 100), xlab = "", ylab = "")
image.plot(x1s, x2s, matrix(log(fx), 100), xlab = "", ylab = "")
optim(c(-2, 0), banana)
## $par
## [1] 1.003 1.006
##
## $value
## [1] 2.08e-05
##
## $counts
## function gradient
## 181 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
We can see the progression of evaluations of the objective function:
banana <- function(x) {
## Rosenbrock Banana function
points(x[1], x[2])
Sys.sleep(0.03)
x1 <- x[1]
x2 <- x[2]
100 * (x2 - x1 * x1)^2 + (1 - x1)^2
}
par(mfrow = c(1, 1), mai = c(0.45, 0.4, 0.1, 0.4))
image.plot(x1s, x2s, matrix(log(fx), 100), xlab = "", ylab = "")
optim(c(-2, 0), banana)
Linear regression and GLMs are of course useful, but often the relationship is not linear, even on some transformed scale.
Additive models and generalized additive models (GAMs) are the more flexible variants on linear models and GLMs.
There are a variety of tools in R for modeling nonlinear and smooth relationships, mirroring the variety of methods in the literature.
One workhorse is gam()
in the mgcv package.
Let's consider height in the earnings dataset.
Any hypotheses about the relationship of earnings with height and education?
library(mgcv)
runif(1) # not clear why this is needed; knitr throws error w/out it
## [1] 0.2015
mod_male <- gam(earn ~ s(height, k = 10) + s(ed, k = 10), data = earnings[earnings$sex ==
1, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
mod_female <- gam(earn ~ s(height, k = 10) + s(ed, k = 10), data = earnings[earnings$sex ==
2, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
summary(mod_male)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10)
## <environment: 0xdb64c8>
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28927 940 30.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(height) 1.00 1.00 3.11 0.078 .
## s(ed) 4.93 5.94 17.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.168 Deviance explained = 17.8%
## GCV score = 4.6545e+08 Scale est. = 4.5925e+08 n = 520
summary(mod_female)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10)
## <environment: 0xdb64c8>
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14620 467 31.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(height) 1.00 1.00 0.51 0.47
## s(ed) 2.16 2.73 47.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.134 Deviance explained = 13.7%
## GCV score = 1.8821e+08 Scale est. = 1.873e+08 n = 859
par(mfrow = c(1, 2))
plot(mod_male)
plot(mod_female)
Suppose we want to account for race in the model.
mod_male_full <- gam(earn ~ s(height, k = 10) + s(ed, k = 10) + race, data = earnings[earnings$sex ==
1, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
mod_female_full <- gam(earn ~ s(height, k = 10) + s(ed, k = 10) + race, data = earnings[earnings$sex ==
2, ])
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
## Warning: the matrix is either rank-deficient or indefinite
summary(mod_male_full)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10) + race
## <environment: 0xdb64c8>
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34098 2165 15.75 <2e-16 ***
## race -4459 1684 -2.65 0.0083 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(height) 1.00 1.00 2.1 0.15
## s(ed) 4.81 5.82 17.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.177 Deviance explained = 18.8%
## GCV score = 4.6118e+08 Scale est. = 4.5425e+08 n = 520
summary(mod_female_full)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## earn ~ s(height, k = 10) + s(ed, k = 10) + race
## <environment: 0xdb64c8>
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13748 928 14.81 <2e-16 ***
## race 741 682 1.09 0.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(height) 1.00 1.00 0.64 0.43
## s(ed) 2.74 3.44 37.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.134 Deviance explained = 13.9%
## GCV score = 1.8844e+08 Scale est. = 1.8718e+08 n = 859
par(mfrow = c(1, 2))
plot(mod_male_full)
plot(mod_female_full)
Comments?
Since R was developed by statisticians, it handles distributions and simulation seamlessly.
All commonly-used distributions have functions in R. Each distribution has a family of functions:
dnorm()
rnorm()
pnorm()
qnorm()
Some of the distributions include the following (in the form of their random number generator function): rnorm()
, runif()
, rbinom()
, rpois()
, rbeta()
, rgamma()
, rt()
.
pnorm(1.96)
## [1] 0.975
qnorm(0.975)
## [1] 1.96
dbinom(0:10, size = 10, prob = 0.3)
## [1] 2.825e-02 1.211e-01 2.335e-01 2.668e-01 2.001e-01 1.029e-01 3.676e-02
## [8] 9.002e-03 1.447e-03 1.378e-04 5.905e-06
dnorm(5)
## [1] 1.487e-06
dt(5, df = 1)
## [1] 0.01224
x <- seq(-5, 5, length = 100)
plot(x, dnorm(x), type = "l")
lines(x, dt(x, df = 1), col = "red")
rmultinom(1, 100, prob = c(0.1, 0.1, 0.2, 0.3, 0.25, 0.05))
## [,1]
## [1,] 12
## [2,] 8
## [3,] 25
## [4,] 26
## [5,] 26
## [6,] 3
x <- seq(0, 10, length = 100)
plot(x, dchisq(x, df = 1), type = "l")
lines(x, dchisq(x, df = 2), col = "red")
We can draw a sample with or without replacement. Here's an example of some code that would be part of coding up a bootstrap.
sample(row.names(state.x77), 7, replace = FALSE)
## [1] "Oklahoma" "Kansas" "Florida" "Mississippi"
## [5] "Missouri" "Maryland" "North Carolina"
mean(earnings$earn, na.rm = TRUE)
## [1] 20015
# here's a bootstrap sample:
smp <- sample(seq_len(nrow(earnings)), replace = TRUE)
mean(earnings$earn[smp], na.rm = TRUE)
## [1] 20798
It's a good idea to use seq_along()
and seq_len()
and not syntax like 1:length(earnings)
in sample()
because the outcome of length()
might in some cases be unexpected (e.g., if you're taking subsets of a dataset). Similar reasoning holds when setting up for loops: e.g.,
for (i in seq_len(nrow(earnings))) {
# blah
}
A few key facts about generating random numbers
To replicate any work involving random numbers, make sure to set the seed first.
set.seed(0)
vals <- rnorm(10)
vals
## [1] 1.262954 -0.326233 1.329799 1.272429 0.414641 -1.539950 -0.928567
## [8] -0.294720 -0.005767 2.404653
vals <- rnorm(10)
vals
## [1] 0.7636 -0.7990 -1.1477 -0.2895 -0.2992 -0.4115 0.2522 -0.8919
## [9] 0.4357 -1.2375
set.seed(0)
vals <- rnorm(10)
vals
## [1] 1.262954 -0.326233 1.329799 1.272429 0.414641 -1.539950 -0.928567
## [8] -0.294720 -0.005767 2.404653
date1 <- as.Date("03-01-2011", format = "%m-%d-%Y")
date2 <- as.Date("03/02/11", format = "%m/%d/%y")
date3 <- as.Date("07-May-11", format = "%d-%b-%y")
date1
## [1] "2011-03-01"
date2
## [1] "2011-03-02"
class(date1)
## [1] "Date"
dates <- c(date1, date2, date3)
weekdays(dates)
## [1] "Tuesday" "Wednesday" "Saturday"
dates + 30
## [1] "2011-03-31" "2011-04-01" "2011-06-06"
date3 - date2
## Time difference of 66 days
unclass(dates)
## [1] 15034 15035 15101
library(chron)
d1 <- chron("12/25/2004", "10:37:59")
# default format of m/d/Y and h:m:s
d2 <- chron("12/26/2004", "11:37:59")
class(d1)
## [1] "chron" "dates" "times"
d1
## [1] (12/25/04 10:37:59)
d1 + 33
## [1] (01/27/05 10:37:59)
d2 - d1
## Time in days:
## [1] 1.042
d1 + d2
## Error: chron objects may not be added together
There's lots more packages/functionality for dates/times: see lubridate and ?DateTimeClasses
rep()
function.n
from a skewed or discrete distribution and show that if n
is big enough, the distribution of the means (of each sample of size n
) looks approximately normal in a histogram. Do it without any looping! I.e., I want you to show that if you have a large number (say 10,000) of means, each mean being the mean of n
values from a distribution, the distribution of the means looks approximately normal if n
is sufficiently big.