R Bootcamp, Module 6: Useful Stuff

August 2013, UC Berkeley

Chris Krogslund (ckrogslund@berkeley.edu)

Now you've got all the basic pieces

Note to BB: remember to start recording.

The Answer

R is great at doing useful stuff and making it very, very easy

Useful Stuff (abridged, obviously)

An Example Dataset

Replication data from Gelman et al., "Rich State, Poor State, Red State, Blue State: What’s the Matter with Connecticut?", Quarterly Journal of Political Science, 2007, 2: 345-367

For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual- level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.

First Steps

Let's see what we're dealing with here.

str(mydata)
## 'data.frame':    76205 obs. of  17 variables:
##  $ state   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ pres04  : int  1 2 1 1 1 1 1 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 2 2 ...
##  $ race    : Factor w/ 5 levels "white","black",..: 1 1 2 2 1 1 1 1 1 1 ...
##  $ age9    : Factor w/ 9 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ partyid : Factor w/ 4 levels "democrat","republican",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ income  : Factor w/ 8 levels "under $15,000",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ relign8 : Factor w/ 8 levels "protestant","catholic",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ age60   : Factor w/ 4 levels "18-29","30-44",..: 1 1 2 2 2 2 2 1 1 1 ...
##  $ age65   : Factor w/ 6 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ geocode : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ sizeplac: Factor w/ 5 levels "city over 500,000",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ brnagain: Factor w/ 2 levels "yes","no": NA NA NA NA NA NA NA NA NA NA ...
##  $ attend  : Factor w/ 6 levels "more than once a week",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ year    : num  2004 2004 2004 2004 2004 ...
##  $ region  : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ y       : num  0 1 0 0 0 0 0 1 1 1 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr " 6 Jun 2007 14:53"
##  - attr(*, "formats")= chr  "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
##  - attr(*, "types")= int  251 251 251 251 251 251 251 251 251 251 ...
##  - attr(*, "val.labels")= chr  "stanum" "presak04" "sex" "race" ...
##  - attr(*, "var.labels")= chr  "state id" "in today's election for president, did you just vote for:" "are you:" "are you:" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 14
##   ..$ stanum  : Named num 2
##   .. ..- attr(*, "names")= chr "alaska"
##   ..$ presak04: Named num  0 1 2 3 9
##   .. ..- attr(*, "names")= chr  "did not vote" "kerry" "bush" "nader" ...
##   ..$ sex     : Named num  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ race    : Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "white" "black" "hispanic/latino" "asian" ...
##   ..$ age9    : Named num  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-44" ...
##   ..$ partyid : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "democrat" "republican" "independent" "something else"
##   ..$ income  : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "under $15,000" "$15,000-$29,999" "$30,000-$49,999" "$50,000-$74,999" ...
##   ..$ relign8 : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "protestant" "catholic" "mormon/lds" "other christian" ...
##   ..$ age60   : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "18-29" "30-44" "45-59" "60 or over"
##   ..$ age65   : Named num  1 2 3 4 5 6
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-49" ...
##   ..$ geocode : Named num  1 2 3
##   .. ..- attr(*, "names")= chr  "juneau/fairbanks/rural" "anchorage" "anchorage-fairbanks corridor"
##   ..$ sizeplac: Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "city over 500,000" "city: 50,000 to 500,000" "suburbs" "city: 10,000 to 49,999" ...
##   ..$ brnagain: Named num  1 2
##   .. ..- attr(*, "names")= chr  "yes" "no"
##   ..$ attend  : Named num  1 2 3 4 5 9
##   .. ..- attr(*, "names")= chr  "more than once a week" "once a week" "a few times a month" "a few times a year" ...

This is a bit messy, but it's a decently comprehensive overview.

First Steps

First Steps

Take a look at the variable "age9"

unique(mydata$age9)
##  [1] 25-29      18-24      30-39      40-44      45-49      65-74     
##  [7] 75 or over 50-59      <NA>       60-64     
## Levels: 18-24 25-29 30-39 40-44 45-49 50-59 60-64 65-74 75 or over

Binned Age | True Age ------ | ----- 1|18-24 2|25-29 3|30-39 4|40-44 5|45-49 6|50-59 7|60-64 8|65-74 9|75+

Let's look at the mean, standard deviation, median, and quintiles for "age9"

mean(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4.544
sd(x = as.numeric(mydata$age9), na.rm = T)
## [1] 2.238
median(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4
quantile(x = as.numeric(mydata$age9), na.rm = T, probs = seq(from = 0, to = 1, 
    by = 0.2))
##   0%  20%  40%  60%  80% 100% 
##    1    3    4    5    6    9

First Steps

We could do the same thing for "income"

unique(mydata$income)
## [1] <NA>            $30,000-$49,999 $15,000-$29,999 $50,000-$74,999
## [5] under $15,000   $75,000-$99,999
## 8 Levels: under $15,000 $15,000-$29,999 ... $200,000 or more

Binned Income | True Income ------ | ----- 1|$0-$14,999 2|$15,000-$29,999 3|$30,000-$49,999 4|$50,000-$74,999 5|$75,000-$99,999 6|$100,000-$149,999 7|$150,000-$199,999 8|$200,000- ∞

mean(x = as.numeric(mydata$income), na.rm = T)
## [1] 2.637
sd(x = as.numeric(mydata$income), na.rm = T)
## [1] 1.206
median(x = as.numeric(mydata$income), na.rm = T)
## [1] 3
quantile(x = as.numeric(mydata$income), na.rm = T, probs = seq(from = 0, to = 1, 
    by = 0.2))
##   0%  20%  40%  60%  80% 100% 
##    1    1    2    3    4    5

First Steps

For example...

Gender balance?

gender <- mydata$sex
gender <- gender[!is.na(gender)]
unique(gender)
## [1] female male  
## Levels: male female
males <- length(gender[gender == "male"])/length(gender) * 100
cat(males, "%", sep = "")
## 45.13%
females <- length(gender[gender == "female"])/length(gender) * 100
cat(females, "%", sep = "")
## 54.87%

For example...

Political balance?

partyid <- mydata$partyid
partyid <- partyid[!is.na(partyid)]
unique(partyid)
## [1] republican     democrat       independent    something else
## Levels: democrat republican independent something else
dem <- length(partyid[partyid == "democrat"])/length(partyid) * 100
cat(dem, "%", sep = "")
## 37.11%
rep <- length(partyid[partyid == "republican"])/length(partyid) * 100
cat(rep, "%", sep = "")
## 35.57%
ind <- length(partyid[partyid == "independent"])/length(partyid) * 100
cat(ind, "%", sep = "")
## 22.64%
other <- length(partyid[partyid == "something else"])/length(partyid) * 100
cat(other, "%", sep = "")
## 4.675%

For example...

Racial balance?

race <- mydata$race
race <- race[!is.na(race)]
unique(race)
## [1] white           black           other           asian          
## [5] hispanic/latino
## Levels: white black hispanic/latino asian other
white <- length(race[race == "white"])/length(race) * 100
cat(white, "%", sep = "")
## 80.79%
black <- length(race[race == "black"])/length(race) * 100
cat(black, "%", sep = "")
## 9.558%
hispanic <- length(race[race == "hispanic/latino"])/length(race) * 100
cat(hispanic, "%", sep = "")
## 5.909%
asian <- length(race[race == "asian"])/length(race) * 100
cat(asian, "%", sep = "")
## 1.371%
other <- length(race[race == "other"])/length(race) * 100
cat(other, "%", sep = "")
## 2.376%

This can get really complicated really quickly

How to tackle these tabulations?

All techniques for this problem rely on the split-apply-combine strategy

First, take the data (or some object) and split it into smaller datasets on the basis of some variable

Dataset A

x|y|z -----|------|----- 1|1|1 2|2|1 3|3|1 4|1|2 5|2|2 6|3|2

Datasets B and C (Dataset A split according to "z")

x|y|z| | | | | |x|y|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 1|1|1| | | | | |4|1|2 2|2|1| | | | | |5|2|2 3|3|1| || | | |6|3|2

How to tackle these tabulations?

Second, apply some function to each one of the smaller datasets/objects

Example function: mean of variables "x" and "y"

Datasets B' and C'

mean(x)|mean(y)|z| | | | | |mean(x)|mean(y)|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 2|2|1| | | | | |5|2|2

How to tackle these tabulations?

Third, combine the results into a larger dataset/object

Datasets B' and C'

mean(x)|mean(y)|z| | | | | |mean(x)|mean(y)|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 2|2|1| | | | | |5|2|2

Dataset A'

mean(x)|mean(y)|z -----|------|----- 2|2|1 5|2|2

Tabulating the hard way

Split

white.sub <- mydata[which(mydata$race == "white" & is.na(mydata$race) == F), 
    ]

black.sub <- mydata[which(mydata$race == "black" & is.na(mydata$race) == F), 
    ]

hispanic.sub <- mydata[which(mydata$race == "hispanic/latino" & is.na(mydata$race) == 
    F), ]

Apply

white.pres04 <- white.sub$pres04
white.pres04 <- white.pres04[is.element(white.pres04, c(1:3, 9)) == T]

white.k <- length(white.pres04[white.pres04 == 1])/length(white.pres04) * 100
white.b <- length(white.pres04[white.pres04 == 2])/length(white.pres04) * 100
white.n <- length(white.pres04[white.pres04 == 3])/length(white.pres04) * 100
white.o <- length(white.pres04[white.pres04 == 9])/length(white.pres04) * 100

black.pres04 <- black.sub$pres04
black.pres04 <- black.pres04[is.element(black.pres04, c(1:3, 9)) == T]

black.k <- length(black.pres04[black.pres04 == 1])/length(black.pres04) * 100
black.b <- length(black.pres04[black.pres04 == 2])/length(black.pres04) * 100
black.n <- length(black.pres04[black.pres04 == 3])/length(black.pres04) * 100
black.o <- length(black.pres04[black.pres04 == 9])/length(black.pres04) * 100

. . .

Combine

percent <- as.vector(c(white.k, white.b, white.n, white.o, black.k, black.b, 
    black.n, black.o))
vote <- rep(c("kerry", "bush", "nader", "other"), 2)
race <- c(rep("white", 4), rep("black", 4))
newdata <- data.frame(race, vote, percent)
print(newdata)
##    race  vote percent
## 1 white kerry 43.6000
## 2 white  bush 55.0246
## 3 white nader  0.6944
## 4 white other  0.6810
## 5 black kerry 87.2812
## 6 black  bush 11.6318
## 7 black nader  0.7905
## 8 black other  0.2964

. . .

Tabulating the less-hard way

Split

races = split(mydata$pres04, mydata$race)

Apply/Combine

results <- matrix(NA, nrow = length(races), ncol = length(c(1:3, 9)))
rownames(results) <- objects(races)
colnames(results) <- c("kerry", "bush", "nader", "other")
print(results)
##                 kerry bush nader other
## asian              NA   NA    NA    NA
## black              NA   NA    NA    NA
## hispanic/latino    NA   NA    NA    NA
## other              NA   NA    NA    NA
## white              NA   NA    NA    NA
for (i in 1:length(races)) {
    race.subset <- races[[i]]
    race.subset <- race.subset[is.element(race.subset, c(1:3, 9)) == T]
    results[i, "kerry"] <- length(race.subset[race.subset == 1])/length(race.subset) * 
        100
    results[i, "bush"] <- length(race.subset[race.subset == 2])/length(race.subset) * 
        100
    results[i, "nader"] <- length(race.subset[race.subset == 3])/length(race.subset) * 
        100
    results[i, "other"] <- length(race.subset[race.subset == 9])/length(race.subset) * 
        100
}
print(results)
##                 kerry  bush  nader  other
## asian           43.60 55.02 0.6944 0.6810
## black           87.28 11.63 0.7905 0.2964
## hispanic/latino 60.75 37.36 0.7750 1.1169
## other           61.43 37.68 0.6869 0.1963
## white           59.44 37.46 1.2622 1.8359

Tabulating the easy way

cleandat <- mydata[which(is.element(mydata$pres04, c(1:3, 9)) == T & is.na(mydata$race) == 
    F), ]

vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race)
##              race kerry  bush  nader  other
## 1           white 43.60 55.02 0.6944 0.6810
## 2           black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4           asian 61.43 37.68 0.6869 0.1963
## 5           other 59.44 37.46 1.2622 1.8359

plyr

plyr (Hadley Wickham, Rice) is the go-to package for all your splitting-applying-combining needs

Among its many benefits (above base R capabilities):

Using plyr: how to select functions

Two questions:

  1. What is the class of your input object?

  2. What is the class of your desired output object?

Using plyr: how to write commands

All of the major plyr functions have the same basic syntax

xxply(.data = , .variables = .(, , , ...), .fun = )

Consider the previous example using ddply:

vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

Using plyr: dlply

vote.by.race.dl <- dlply(.data = cleandat, .variables = .(race), summarize, 
    kerry = length(pres04[pres04 == 1])/length(pres04) * 100, bush = length(pres04[pres04 == 
        2])/length(pres04) * 100, nader = length(pres04[pres04 == 3])/length(pres04) * 
        100, other = length(pres04[pres04 == 9])/length(pres04) * 100)

print(vote.by.race.dl)
## $white
##   kerry  bush  nader other
## 1  43.6 55.02 0.6944 0.681
## 
## $black
##   kerry  bush  nader  other
## 1 87.28 11.63 0.7905 0.2964
## 
## $`hispanic/latino`
##   kerry  bush nader other
## 1 60.75 37.36 0.775 1.117
## 
## $asian
##   kerry  bush  nader  other
## 1 61.43 37.68 0.6869 0.1963
## 
## $other
##   kerry  bush nader other
## 1 59.44 37.46 1.262 1.836
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##              race
## 1           white
## 2           black
## 3 hispanic/latino
## 4           asian
## 5           other

Using plyr: ldply

races = split(cleandat, cleandat$race)

objects(races)
## [1] "asian"           "black"           "hispanic/latino" "other"          
## [5] "white"
vote.by.race.ld <- ldply(.data = races, summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race.ld)
##               .id kerry  bush  nader  other
## 1           white 43.60 55.02 0.6944 0.6810
## 2           black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4           asian 61.43 37.68 0.6869 0.1963
## 5           other 59.44 37.46 1.2622 1.8359

Using plyr: llply

vote.by.race.ll <- llply(.data = races, summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race.ll)
## $white
##   kerry  bush  nader other
## 1  43.6 55.02 0.6944 0.681
## 
## $black
##   kerry  bush  nader  other
## 1 87.28 11.63 0.7905 0.2964
## 
## $`hispanic/latino`
##   kerry  bush nader other
## 1 60.75 37.36 0.775 1.117
## 
## $asian
##   kerry  bush  nader  other
## 1 61.43 37.68 0.6869 0.1963
## 
## $other
##   kerry  bush nader other
## 1 59.44 37.46 1.262 1.836

Common functions used with plyr: transform

transform: applies a function to a data frame and returns the altered version of that data frame

olddat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 == 
    "65 or over", 1, 0))

print(olddat[20:40, c("sex", "race", "age65", "old")])
##       sex  race      age65 old
## 20   male white      30-39   0
## 21   male white      40-49   0
## 22 female white      40-49   0
## 23   male white      30-39   0
## 24   male white      40-49   0
## 25   male white 65 or over   1
## 26 female white      30-39   0
## 27   male white      25-29   0
## 28   male white      30-39   0
## 29 female white      40-49   0
## 30   male white      50-64   0
## 31   male white      50-64   0
## 32 female white      40-49   0
## 33 female white      40-49   0
## 34   male white      40-49   0
## 35 female white      50-64   0
## 36   male white      40-49   0
## 37 female white      50-64   0
## 38   male white 65 or over   1
## 39   male white      18-24   0
## 40 female white      30-39   0

Note that transform can't do transformations that involve the results of other transformations from the same call

oldmendat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 == 
    "65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))
## Error: object 'old' not found

For this, we need...

Common functions used with plyr: mutate

mutate: just like transform, but it executes the commands iteratively so that transformations can be carried out that rely on previous transformations from the same call

oldmendat <- ddply(.data = cleandat, .variables = .(race), mutate, old = ifelse(age65 == 
    "65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))

print(oldmendat[40:80, c("sex", "race", "age65", "old", "old.man")])
##       sex  race      age65 old old.man
## 40 female white      30-39   0       0
## 41 female white      50-64   0       0
## 42   male white      30-39   0       0
## 43 female white      30-39   0       0
## 44 female white      40-49   0       0
## 45   male white      30-39   0       0
## 46 female white      30-39   0       0
## 47   male white      50-64   0       0
## 48 female white      25-29   0       0
## 49 female white      40-49   0       0
## 50 female white      30-39   0       0
## 51   male white      25-29   0       0
## 52   male white      30-39   0       0
## 53 female white      40-49   0       0
## 54   male white      30-39   0       0
## 55   male white      30-39   0       0
## 56   male white      40-49   0       0
## 57   male white      40-49   0       0
## 58   male white      18-24   0       0
## 59 female white      40-49   0       0
## 60 female white 65 or over   1       0
## 61 female white      50-64   0       0
## 62 female white      30-39   0       0
## 63   male white      50-64   0       0
## 64   male white      40-49   0       0
## 65   male white 65 or over   1       1
## 66   male white      30-39   0       0
## 67 female white      50-64   0       0
## 68 female white 65 or over   1       0
## 69   male white      30-39   0       0
## 70 female white      50-64   0       0
## 71 female white      30-39   0       0
## 72   male white      18-24   0       0
## 73 female white      50-64   0       0
## 74 female white      25-29   0       0
## 75   male white      40-49   0       0
## 76   male white      18-24   0       0
## 77   male white      40-49   0       0
## 78 female white      40-49   0       0
## 79 female white 65 or over   1       0
## 80   male white      40-49   0       0

Common functions used with plyr: arrange

arrange: orders a data frame on the basis of column contents

snippet <- oldmendat[1:50, c("sex", "race", "age65", "old", "old.man")]
print(snippet)
##       sex  race      age65 old old.man
## 1  female white      25-29   0       0
## 2    male white      18-24   0       0
## 3  female white      40-49   0       0
## 4  female white      30-39   0       0
## 5    male white      40-49   0       0
## 6  female white      18-24   0       0
## 7  female white      25-29   0       0
## 8  female white      18-24   0       0
## 9  female white      25-29   0       0
## 10 female white      40-49   0       0
## 11 female white      40-49   0       0
## 12   male white      40-49   0       0
## 13   male white      30-39   0       0
## 14   male white      40-49   0       0
## 15   male white      25-29   0       0
## 16 female white      25-29   0       0
## 17 female white      30-39   0       0
## 18   male white      25-29   0       0
## 19 female white      25-29   0       0
## 20   male white      30-39   0       0
## 21   male white      40-49   0       0
## 22 female white      40-49   0       0
## 23   male white      30-39   0       0
## 24   male white      40-49   0       0
## 25   male white 65 or over   1       1
## 26 female white      30-39   0       0
## 27   male white      25-29   0       0
## 28   male white      30-39   0       0
## 29 female white      40-49   0       0
## 30   male white      50-64   0       0
## 31   male white      50-64   0       0
## 32 female white      40-49   0       0
## 33 female white      40-49   0       0
## 34   male white      40-49   0       0
## 35 female white      50-64   0       0
## 36   male white      40-49   0       0
## 37 female white      50-64   0       0
## 38   male white 65 or over   1       1
## 39   male white      18-24   0       0
## 40 female white      30-39   0       0
## 41 female white      50-64   0       0
## 42   male white      30-39   0       0
## 43 female white      30-39   0       0
## 44 female white      40-49   0       0
## 45   male white      30-39   0       0
## 46 female white      30-39   0       0
## 47   male white      50-64   0       0
## 48 female white      25-29   0       0
## 49 female white      40-49   0       0
## 50 female white      30-39   0       0
arrange(df = snippet, sex, desc(age65))
##       sex  race      age65 old old.man
## 1    male white 65 or over   1       1
## 2    male white 65 or over   1       1
## 3    male white      50-64   0       0
## 4    male white      50-64   0       0
## 5    male white      50-64   0       0
## 6    male white      40-49   0       0
## 7    male white      40-49   0       0
## 8    male white      40-49   0       0
## 9    male white      40-49   0       0
## 10   male white      40-49   0       0
## 11   male white      40-49   0       0
## 12   male white      40-49   0       0
## 13   male white      30-39   0       0
## 14   male white      30-39   0       0
## 15   male white      30-39   0       0
## 16   male white      30-39   0       0
## 17   male white      30-39   0       0
## 18   male white      30-39   0       0
## 19   male white      25-29   0       0
## 20   male white      25-29   0       0
## 21   male white      25-29   0       0
## 22   male white      18-24   0       0
## 23   male white      18-24   0       0
## 24 female white      50-64   0       0
## 25 female white      50-64   0       0
## 26 female white      50-64   0       0
## 27 female white      40-49   0       0
## 28 female white      40-49   0       0
## 29 female white      40-49   0       0
## 30 female white      40-49   0       0
## 31 female white      40-49   0       0
## 32 female white      40-49   0       0
## 33 female white      40-49   0       0
## 34 female white      40-49   0       0
## 35 female white      40-49   0       0
## 36 female white      30-39   0       0
## 37 female white      30-39   0       0
## 38 female white      30-39   0       0
## 39 female white      30-39   0       0
## 40 female white      30-39   0       0
## 41 female white      30-39   0       0
## 42 female white      30-39   0       0
## 43 female white      25-29   0       0
## 44 female white      25-29   0       0
## 45 female white      25-29   0       0
## 46 female white      25-29   0       0
## 47 female white      25-29   0       0
## 48 female white      25-29   0       0
## 49 female white      18-24   0       0
## 50 female white      18-24   0       0
arrange(df = snippet, desc(sex), age65)
##       sex  race      age65 old old.man
## 1  female white      18-24   0       0
## 2  female white      18-24   0       0
## 3  female white      25-29   0       0
## 4  female white      25-29   0       0
## 5  female white      25-29   0       0
## 6  female white      25-29   0       0
## 7  female white      25-29   0       0
## 8  female white      25-29   0       0
## 9  female white      30-39   0       0
## 10 female white      30-39   0       0
## 11 female white      30-39   0       0
## 12 female white      30-39   0       0
## 13 female white      30-39   0       0
## 14 female white      30-39   0       0
## 15 female white      30-39   0       0
## 16 female white      40-49   0       0
## 17 female white      40-49   0       0
## 18 female white      40-49   0       0
## 19 female white      40-49   0       0
## 20 female white      40-49   0       0
## 21 female white      40-49   0       0
## 22 female white      40-49   0       0
## 23 female white      40-49   0       0
## 24 female white      40-49   0       0
## 25 female white      50-64   0       0
## 26 female white      50-64   0       0
## 27 female white      50-64   0       0
## 28   male white      18-24   0       0
## 29   male white      18-24   0       0
## 30   male white      25-29   0       0
## 31   male white      25-29   0       0
## 32   male white      25-29   0       0
## 33   male white      30-39   0       0
## 34   male white      30-39   0       0
## 35   male white      30-39   0       0
## 36   male white      30-39   0       0
## 37   male white      30-39   0       0
## 38   male white      30-39   0       0
## 39   male white      40-49   0       0
## 40   male white      40-49   0       0
## 41   male white      40-49   0       0
## 42   male white      40-49   0       0
## 43   male white      40-49   0       0
## 44   male white      40-49   0       0
## 45   male white      40-49   0       0
## 46   male white      50-64   0       0
## 47   male white      50-64   0       0
## 48   male white      50-64   0       0
## 49   male white 65 or over   1       1
## 50   male white 65 or over   1       1

Last note on functions

Any function will work, just remember that when writing function(x), the x that you're getting is one slice of your input object with a single value for your chosen splitting variable. Is the same class as your original object, and all functions you could apply to the original object can be applied its subsets (this may be important in the very near future, hint hint...)

Reshaping Data

Reshaping Data

##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    79.87    78.58    99.02    80.71    51.56   38.850    46.85
## 2   Austria    17.29    59.17    23.12    73.41    22.06    3.348    42.14
## 3   Belgium    59.12    62.54    23.11    91.34    71.09   10.812    63.65
## 4    Canada    28.37    84.69     6.49    73.54    74.89   85.872    90.31
## 5   Denmark    80.23    21.18    45.45    21.14    98.16   65.903    77.76
## 6    France    39.23    44.18    89.71    50.10    91.92   69.633    50.06

Reshaping Data

reshape2

Using reshape2: melt

Basic idea is to take your data frame and melt it into three essential components

  1. Case identifiers (e.g.: an individual, a country)

  2. Characteristic variables (e.g.: age, race, gender)

  3. Variable values (e.g.: {White, Black, Hispanic/Latino, Asian, Other} for race)

Using reshape2: melt

In the example from before, our only case identifier is country, and our only characteristic variable is var (with observations from 1995-2001)

##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    79.87    78.58    99.02    80.71    51.56   38.850    46.85
## 2   Austria    17.29    59.17    23.12    73.41    22.06    3.348    42.14
## 3   Belgium    59.12    62.54    23.11    91.34    71.09   10.812    63.65
## 4    Canada    28.37    84.69     6.49    73.54    74.89   85.872    90.31
## 5   Denmark    80.23    21.18    45.45    21.14    98.16   65.903    77.76
## 6    France    39.23    44.18    89.71    50.10    91.92   69.633    50.06

Using reshape2: melt

The generic call for melt looks like this:

melt(data = , id.vars = c("    ", "     ", "    ", ...))

Note that you can also customize the names of the columns that store the variable names and values

We can melt the previous example to a nice time series by the call

wb.melted <- melt(data = wb, id.vars = "country")
print(wb.melted)
##      country variable  value
## 1  Australia var_1995 79.871
## 2    Austria var_1995 17.289
## 3    Belgium var_1995 59.116
## 4     Canada var_1995 28.370
## 5    Denmark var_1995 80.233
## 6     France var_1995 39.233
## 7  Australia var_1996 78.579
## 8    Austria var_1996 59.174
## 9    Belgium var_1996 62.541
## 10    Canada var_1996 84.695
## 11   Denmark var_1996 21.177
## 12    France var_1996 44.178
## 13 Australia var_1997 99.024
## 14   Austria var_1997 23.122
## 15   Belgium var_1997 23.114
## 16    Canada var_1997  6.490
## 17   Denmark var_1997 45.448
## 18    France var_1997 89.707
## 19 Australia var_1998 80.709
## 20   Austria var_1998 73.408
## 21   Belgium var_1998 91.341
## 22    Canada var_1998 73.542
## 23   Denmark var_1998 21.139
## 24    France var_1998 50.097
## 25 Australia var_1999 51.560
## 26   Austria var_1999 22.057
## 27   Belgium var_1999 71.092
## 28    Canada var_1999 74.895
## 29   Denmark var_1999 98.161
## 30    France var_1999 91.917
## 31 Australia var_2000 38.850
## 32   Austria var_2000  3.348
## 33   Belgium var_2000 10.812
## 34    Canada var_2000 85.872
## 35   Denmark var_2000 65.903
## 36    France var_2000 69.633
## 37 Australia var_2001 46.845
## 38   Austria var_2001 42.143
## 39   Belgium var_2001 63.649
## 40    Canada var_2001 90.307
## 41   Denmark var_2001 77.762
## 42    France var_2001 50.064
wb.melted$variable <- as.numeric(gsub(pattern = "var_", replacement = "", x = wb.melted$variable))
print(wb.melted)
##      country variable  value
## 1  Australia     1995 79.871
## 2    Austria     1995 17.289
## 3    Belgium     1995 59.116
## 4     Canada     1995 28.370
## 5    Denmark     1995 80.233
## 6     France     1995 39.233
## 7  Australia     1996 78.579
## 8    Austria     1996 59.174
## 9    Belgium     1996 62.541
## 10    Canada     1996 84.695
## 11   Denmark     1996 21.177
## 12    France     1996 44.178
## 13 Australia     1997 99.024
## 14   Austria     1997 23.122
## 15   Belgium     1997 23.114
## 16    Canada     1997  6.490
## 17   Denmark     1997 45.448
## 18    France     1997 89.707
## 19 Australia     1998 80.709
## 20   Austria     1998 73.408
## 21   Belgium     1998 91.341
## 22    Canada     1998 73.542
## 23   Denmark     1998 21.139
## 24    France     1998 50.097
## 25 Australia     1999 51.560
## 26   Austria     1999 22.057
## 27   Belgium     1999 71.092
## 28    Canada     1999 74.895
## 29   Denmark     1999 98.161
## 30    France     1999 91.917
## 31 Australia     2000 38.850
## 32   Austria     2000  3.348
## 33   Belgium     2000 10.812
## 34    Canada     2000 85.872
## 35   Denmark     2000 65.903
## 36    France     2000 69.633
## 37 Australia     2001 46.845
## 38   Austria     2001 42.143
## 39   Belgium     2001 63.649
## 40    Canada     2001 90.307
## 41   Denmark     2001 77.762
## 42    France     2001 50.064

This comes in really handy when we want to make, say, panel data graphs (though random variabes aren't that great to look at) plot of chunk unnamed-chunk-32

Using reshape2: cast (acast, dcast)

Using reshape2: cast (acast, dcast)

The generic call for (d)cast looks like this:

dcast(data = , formula = x_var1 + x_var2 ~ y_var1 + yvar2, fun.aggregate = )

Note that data frames cannot be produced with more than two dimensions

Using reshape2: cast (acast, dcast)

We can then recast our melted data from the previous example back into its original format

wb.melted$variable <- paste("var_", wb.melted$variable, sep = "")

wb.recast <- dcast(data = wb.melted, formula = country ~ variable)

print(wb.recast)
##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    79.87    78.58    99.02    80.71    51.56   38.850    46.85
## 2   Austria    17.29    59.17    23.12    73.41    22.06    3.348    42.14
## 3   Belgium    59.12    62.54    23.11    91.34    71.09   10.812    63.65
## 4    Canada    28.37    84.69     6.49    73.54    74.89   85.872    90.31
## 5   Denmark    80.23    21.18    45.45    21.14    98.16   65.903    77.76
## 6    France    39.23    44.18    89.71    50.10    91.92   69.633    50.06

Regression

Running regressions in R is extremely simple, very straightforwd (though doing things with standard errors requires a little extra work)

glm

The basic glm call looks something like this:

glm(formula = y ~ x1 + x2 + x3 + ..., family = familyname(link = "linkname"), 
    data = )

There are a whole bunch of families and links to use (help(family) for a full list), but some essentials are binomial(link = "logit"), gaussian(link = "identity"), and poisson(link = "log")

An Example

Suppose we want to regress being an old man on political party identification, income, and religion using a logit model. The glm call would be something like this:

oldman.reg <- glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"), 
    data = oldmendat)

When we store this regression in an object, we get access to several items of interest

objects(oldman.reg)
##  [1] "aic"               "boundary"          "call"             
##  [4] "coefficients"      "contrasts"         "control"          
##  [7] "converged"         "data"              "deviance"         
## [10] "df.null"           "df.residual"       "effects"          
## [13] "family"            "fitted.values"     "formula"          
## [16] "iter"              "linear.predictors" "method"           
## [19] "model"             "na.action"         "null.deviance"    
## [22] "offset"            "prior.weights"     "qr"               
## [25] "R"                 "rank"              "residuals"        
## [28] "terms"             "weights"           "xlevels"          
## [31] "y"
oldman.reg$coefficients
##            (Intercept)      partyidrepublican     partyidindependent 
##                -2.0321                 0.0907                 0.1398 
##  partyidsomething else  income$15,000-$29,999  income$30,000-$49,999 
##                -0.7690                -0.1353                -0.5616 
##  income$50,000-$74,999  income$75,000-$99,999        relign8catholic 
##                -0.8756                -0.8373                -0.5462 
##      relign8mormon/lds relign8other christian          relign8jewish 
##                -0.5397                -1.8787                 0.1290 
##          relign8muslim  relign8something else            relign8none 
##                -1.6732                -1.7342                -1.2106
oldman.reg$df.residual
## [1] 62949
oldman.reg$aic
## [1] 23042

summary.glm()

But often times, we just want a nice summary. R has a series of "summary" functions for certain object classes ("glm", in this case) that automatically create nice overviews of the regression results from all the information contained in a fitted object.

sum.oldman.reg <- summary(oldman.reg)

print(sum.oldman.reg)
## 
## Call:
## glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.563  -0.376  -0.267  -0.193   3.239  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.0321     0.0463  -43.86  < 2e-16 ***
## partyidrepublican        0.0907     0.0449    2.02  0.04319 *  
## partyidindependent       0.1398     0.0500    2.80  0.00512 ** 
## partyidsomething else   -0.7690     0.1420   -5.42  6.1e-08 ***
## income$15,000-$29,999   -0.1353     0.0499   -2.71  0.00668 ** 
## income$30,000-$49,999   -0.5616     0.0543  -10.35  < 2e-16 ***
## income$50,000-$74,999   -0.8756     0.0552  -15.85  < 2e-16 ***
## income$75,000-$99,999   -0.8373     0.1207   -6.93  4.1e-12 ***
## relign8catholic         -0.5462     0.0461  -11.85  < 2e-16 ***
## relign8mormon/lds       -0.5397     0.1066   -5.06  4.1e-07 ***
## relign8other christian  -1.8787     0.0745  -25.21  < 2e-16 ***
## relign8jewish            0.1290     0.1071    1.20  0.22856    
## relign8muslim           -1.6732     0.4527   -3.70  0.00022 ***
## relign8something else   -1.7342     0.1249  -13.88  < 2e-16 ***
## relign8none             -1.2106     0.0794  -15.25  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24542  on 62963  degrees of freedom
## Residual deviance: 23012  on 62949  degrees of freedom
##   (11324 observations deleted due to missingness)
## AIC: 23042
## 
## Number of Fisher Scoring iterations: 7

Can also extract useful things from the summary object (like a matrix of coefficient estimates...)

objects(sum.oldman.reg)
##  [1] "aic"            "aliased"        "call"           "coefficients"  
##  [5] "contrasts"      "cov.scaled"     "cov.unscaled"   "deviance"      
##  [9] "deviance.resid" "df"             "df.null"        "df.residual"   
## [13] "dispersion"     "family"         "iter"           "na.action"     
## [17] "null.deviance"  "terms"
coef <- sum.oldman.reg$coefficients
print(coef)
##                        Estimate Std. Error z value   Pr(>|z|)
## (Intercept)             -2.0321    0.04634 -43.855  0.000e+00
## partyidrepublican        0.0907    0.04486   2.022  4.319e-02
## partyidindependent       0.1398    0.04995   2.800  5.118e-03
## partyidsomething else   -0.7690    0.14196  -5.417  6.068e-08
## income$15,000-$29,999   -0.1353    0.04988  -2.712  6.684e-03
## income$30,000-$49,999   -0.5616    0.05428 -10.347  4.326e-25
## income$50,000-$74,999   -0.8756    0.05524 -15.851  1.391e-56
## income$75,000-$99,999   -0.8373    0.12075  -6.935  4.074e-12
## relign8catholic         -0.5462    0.04610 -11.848  2.198e-32
## relign8mormon/lds       -0.5397    0.10661  -5.062  4.146e-07
## relign8other christian  -1.8787    0.07451 -25.215 2.746e-140
## relign8jewish            0.1290    0.10711   1.204  2.286e-01
## relign8muslim           -1.6732    0.45273  -3.696  2.192e-04
## relign8something else   -1.7342    0.12494 -13.881  8.284e-44
## relign8none             -1.2106    0.07940 -15.248  1.693e-52

Factors with glm

Note that, in our results, R has broken up our variables into their different factor levels (as it will do whenever your regressors have factor levels)

If your data aren't factorized, you can tell glm to factorize a variable (i.e. create dummy variables on the fly) by writing

glm(formula = y ~ x1 + x2 + factor(x3), family = family(link = "link"), data = data)

This is really helpful for doing fixed effects (although you may need to use either the package plm, or de-mean the data in advance, if you want two-way fixed effects -- and careful with those standard errors)

Interactions with glm

x1:x2 interacts all terms in x1 with all terms in x2

summary(glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"), 
    data = oldmendat))
## 
## Call:
## glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.474  -0.416  -0.321  -0.190   3.018  
## 
## Coefficients: (1 not defined because of singularities)
##                                               Estimate Std. Error z value
## (Intercept)                                   -4.38670    0.35575  -12.33
## partyiddemocrat:relign8protestant              2.05179    0.35847    5.72
## partyidrepublican:relign8protestant            1.98342    0.35764    5.55
## partyidindependent:relign8protestant           2.12719    0.35953    5.92
## partyidsomething else:relign8protestant        1.26875    0.41027    3.09
## partyiddemocrat:relign8catholic                1.44507    0.36035    4.01
## partyidrepublican:relign8catholic              1.54119    0.36084    4.27
## partyidindependent:relign8catholic             1.50967    0.36320    4.16
## partyidsomething else:relign8catholic          0.70361    0.45322    1.55
## partyiddemocrat:relign8mormon/lds              1.83103    0.41190    4.45
## partyidrepublican:relign8mormon/lds            1.37272    0.38074    3.61
## partyidindependent:relign8mormon/lds           1.75105    0.42141    4.16
## partyidsomething else:relign8mormon/lds        0.84574    0.80065    1.06
## partyiddemocrat:relign8other christian         0.37790    0.36992    1.02
## partyidrepublican:relign8other christian       0.16285    0.37294    0.44
## partyidindependent:relign8other christian      0.40099    0.38279    1.05
## partyidsomething else:relign8other christian   0.07055    0.47738    0.15
## partyiddemocrat:relign8jewish                  1.92954    0.37784    5.11
## partyidrepublican:relign8jewish                2.25951    0.41859    5.40
## partyidindependent:relign8jewish               2.19307    0.40300    5.44
## partyidsomething else:relign8jewish          -10.17936  130.15390   -0.08
## partyiddemocrat:relign8muslim                  0.05597    0.79570    0.07
## partyidrepublican:relign8muslim                1.77174    0.69590    2.55
## partyidindependent:relign8muslim               0.00468    1.06727    0.00
## partyidsomething else:relign8muslim          -10.17936  202.51553   -0.05
## partyiddemocrat:relign8something else          0.24819    0.39916    0.62
## partyidrepublican:relign8something else        0.86476    0.42194    2.05
## partyidindependent:relign8something else       0.55546    0.40917    1.36
## partyidsomething else:relign8something else   -0.15659    0.61580   -0.25
## partyiddemocrat:relign8none                    0.79135    0.37254    2.12
## partyidrepublican:relign8none                  1.09171    0.38746    2.82
## partyidindependent:relign8none                 0.97190    0.37511    2.59
## partyidsomething else:relign8none                   NA         NA      NA
##                                              Pr(>|z|)    
## (Intercept)                                   < 2e-16 ***
## partyiddemocrat:relign8protestant             1.0e-08 ***
## partyidrepublican:relign8protestant           2.9e-08 ***
## partyidindependent:relign8protestant          3.3e-09 ***
## partyidsomething else:relign8protestant       0.00199 ** 
## partyiddemocrat:relign8catholic               6.1e-05 ***
## partyidrepublican:relign8catholic             1.9e-05 ***
## partyidindependent:relign8catholic            3.2e-05 ***
## partyidsomething else:relign8catholic         0.12055    
## partyiddemocrat:relign8mormon/lds             8.8e-06 ***
## partyidrepublican:relign8mormon/lds           0.00031 ***
## partyidindependent:relign8mormon/lds          3.2e-05 ***
## partyidsomething else:relign8mormon/lds       0.29082    
## partyiddemocrat:relign8other christian        0.30698    
## partyidrepublican:relign8other christian      0.66236    
## partyidindependent:relign8other christian     0.29484    
## partyidsomething else:relign8other christian  0.88251    
## partyiddemocrat:relign8jewish                 3.3e-07 ***
## partyidrepublican:relign8jewish               6.7e-08 ***
## partyidindependent:relign8jewish              5.3e-08 ***
## partyidsomething else:relign8jewish           0.93766    
## partyiddemocrat:relign8muslim                 0.94392    
## partyidrepublican:relign8muslim               0.01090 *  
## partyidindependent:relign8muslim              0.99650    
## partyidsomething else:relign8muslim           0.95991    
## partyiddemocrat:relign8something else         0.53409    
## partyidrepublican:relign8something else       0.04041 *  
## partyidindependent:relign8something else      0.17462    
## partyidsomething else:relign8something else   0.79927    
## partyiddemocrat:relign8none                   0.03365 *  
## partyidrepublican:relign8none                 0.00484 ** 
## partyidindependent:relign8none                0.00957 ** 
## partyidsomething else:relign8none                  NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26783  on 66822  degrees of freedom
## Residual deviance: 25533  on 66791  degrees of freedom
##   (7465 observations deleted due to missingness)
## AIC: 25597
## 
## Number of Fisher Scoring iterations: 13

Interactions with glm

x1*x2 produces the cross of x1 and x2, or x1+x2+x1:x2

summary(glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"), 
    data = oldmendat))
## 
## Call:
## glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.474  -0.416  -0.321  -0.190   3.018  
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                   -2.33492    0.04414  -52.90
## partyidrepublican                             -0.06837    0.05744   -1.19
## partyidindependent                             0.07540    0.06823    1.11
## partyidsomething else                         -0.78303    0.20909   -3.74
## relign8catholic                               -0.60671    0.07245   -8.37
## relign8mormon/lds                             -0.22076    0.21226   -1.04
## relign8other christian                        -1.67388    0.11060  -15.13
## relign8jewish                                 -0.12225    0.13473   -0.91
## relign8muslim                                 -1.99582    0.71311   -2.80
## relign8something else                         -1.80360    0.18633   -9.68
## relign8none                                   -1.26044    0.11908  -10.58
## partyidrepublican:relign8catholic              0.16449    0.10123    1.62
## partyidindependent:relign8catholic            -0.01080    0.11540   -0.09
## partyidsomething else:relign8catholic          0.04157    0.35479    0.12
## partyidrepublican:relign8mormon/lds           -0.38994    0.25458   -1.53
## partyidindependent:relign8mormon/lds          -0.15538    0.31431   -0.49
## partyidsomething else:relign8mormon/lds       -0.20225    0.77545   -0.26
## partyidrepublican:relign8other christian      -0.14668    0.16159   -0.91
## partyidindependent:relign8other christian     -0.05231    0.18685   -0.28
## partyidsomething else:relign8other christian   0.47568    0.39413    1.21
## partyidrepublican:relign8jewish                0.39834    0.26108    1.53
## partyidindependent:relign8jewish               0.18813    0.23815    0.79
## partyidsomething else:relign8jewish          -11.32587  130.15364   -0.09
## partyidrepublican:relign8muslim                1.78415    0.93145    1.92
## partyidindependent:relign8muslim              -0.12670    1.23440   -0.10
## partyidsomething else:relign8muslim           -9.45230  202.51658   -0.05
## partyidrepublican:relign8something else        0.68495    0.29589    2.31
## partyidindependent:relign8something else       0.23187    0.27981    0.83
## partyidsomething else:relign8something else    0.37825    0.57372    0.66
## partyidrepublican:relign8none                  0.36873    0.19774    1.86
## partyidindependent:relign8none                 0.10515    0.17617    0.60
## partyidsomething else:relign8none             -0.00832    0.42721   -0.02
##                                              Pr(>|z|)    
## (Intercept)                                   < 2e-16 ***
## partyidrepublican                             0.23393    
## partyidindependent                            0.26909    
## partyidsomething else                         0.00018 ***
## relign8catholic                               < 2e-16 ***
## relign8mormon/lds                             0.29832    
## relign8other christian                        < 2e-16 ***
## relign8jewish                                 0.36423    
## relign8muslim                                 0.00513 ** 
## relign8something else                         < 2e-16 ***
## relign8none                                   < 2e-16 ***
## partyidrepublican:relign8catholic             0.10419    
## partyidindependent:relign8catholic            0.92543    
## partyidsomething else:relign8catholic         0.90673    
## partyidrepublican:relign8mormon/lds           0.12560    
## partyidindependent:relign8mormon/lds          0.62107    
## partyidsomething else:relign8mormon/lds       0.79423    
## partyidrepublican:relign8other christian      0.36400    
## partyidindependent:relign8other christian     0.77950    
## partyidsomething else:relign8other christian  0.22746    
## partyidrepublican:relign8jewish               0.12708    
## partyidindependent:relign8jewish              0.42956    
## partyidsomething else:relign8jewish           0.93066    
## partyidrepublican:relign8muslim               0.05544 .  
## partyidindependent:relign8muslim              0.91825    
## partyidsomething else:relign8muslim           0.96277    
## partyidrepublican:relign8something else       0.02062 *  
## partyidindependent:relign8something else      0.40730    
## partyidsomething else:relign8something else   0.50970    
## partyidrepublican:relign8none                 0.06222 .  
## partyidindependent:relign8none                0.55060    
## partyidsomething else:relign8none             0.98447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26783  on 66822  degrees of freedom
## Residual deviance: 25533  on 66791  degrees of freedom
##   (7465 observations deleted due to missingness)
## AIC: 25597
## 
## Number of Fisher Scoring iterations: 13

Regression Diagnostics

The package lmtest has most of what you'll need to run basic regression diagnostics.

Breusch-Pagan Test for Heteroscedasticity

bptest(oldman.reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16

Breusch-Pagan Test for Heteroscedasticity

bptest(oldman.reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16

Also have tests for autocorrelation of disturbances (Durwin-Watson), higher-order serial correlation (Breusch-Godfrey)

Can also estimate heteroskedasticity/autocorrelation consistent standard errors via coeftest and the sandwich package

coeftest(x = oldman.reg, vcov. = vcovHC)
## 
## z test of coefficients:
## 
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.0321     0.0474  -42.86  < 2e-16 ***
## partyidrepublican        0.0907     0.0451    2.01  0.04440 *  
## partyidindependent       0.1398     0.0498    2.81  0.00499 ** 
## partyidsomething else   -0.7690     0.1425   -5.40  6.8e-08 ***
## income$15,000-$29,999   -0.1353     0.0497   -2.72  0.00649 ** 
## income$30,000-$49,999   -0.5616     0.0542  -10.35  < 2e-16 ***
## income$50,000-$74,999   -0.8756     0.0554  -15.81  < 2e-16 ***
## income$75,000-$99,999   -0.8373     0.1204   -6.95  3.6e-12 ***
## relign8catholic         -0.5462     0.0462  -11.81  < 2e-16 ***
## relign8mormon/lds       -0.5397     0.1072   -5.04  4.8e-07 ***
## relign8other christian  -1.8787     0.0744  -25.26  < 2e-16 ***
## relign8jewish            0.1290     0.1073    1.20  0.22938    
## relign8muslim           -1.6732     0.4559   -3.67  0.00024 ***
## relign8something else   -1.7342     0.1258  -13.79  < 2e-16 ***
## relign8none             -1.2106     0.0802  -15.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For panel data, the plm package also allows you to compute panel-corrected (Beck-Katz) standard errors

Breakout and overnight homework

Consider the voting preference data. Fit a logistic regression to the data for California (state number 5), modeling preference for Bush (pres04=2) vs. Kerry (pres04=1) (removing the other candidates) as a function of income, potentially including additional covariates such as sex, race and age. What do you find in terms of how income associates with voting preference?

How do you predict the probability of voting for Kerry for a given set of covariate values? Consider the predict.glm() function and what its help page says. Or write code that converts from the model coefficients to the probability scale. Compare the predicted probability of voting for Kerry for a white male of high income with a white male of low income. Do the same for white females.

If you're feeling ambitious...

Using the tools for stratified analyses we have seen today, fit separate models of voting preference as a function of (personal) income and sex (and potentially additional covariates) for each state (or a collection of states of interest to you). How do the effect of income and sex vary by state? The file stateIncome.txt is a tab-delimited file of mean income by state. See if the effects of income and sex are correlated with the state-level mean income.

For our purposes here, don't worry much about the uncertainty in the estimates you get in the logistic regression modeling, but of course in a real analysis you would need to do this.

Note that the translation from the state numbers to names can be obtained by looking at: numToName <- data.frame(name = row.names(state.x77), num = 1:50).

Breakout Answers!