Now you've got all the basic pieces

Note to BB: remember to start recording.

The Answer

R is great at doing useful stuff and making it very, very easy

Useful Stuff (abridged, obviously)

An Example Dataset

Replication data from Gelman et al., “Rich State, Poor State, Red State, Blue State: What’s the Matter with Connecticut?”, Quarterly Journal of Political Science, 2007, 2: 345-367

For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual- level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.

First Steps

Let's see what we're dealing with here.

str(mydata)
## 'data.frame':    76205 obs. of  17 variables:
##  $ state   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ pres04  : int  1 2 1 1 1 1 1 2 2 2 ...
##  $ sex     : Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 2 2 ...
##  $ race    : Factor w/ 5 levels "white","black",..: 1 1 2 2 1 1 1 1 1 1 ...
##  $ age9    : Factor w/ 9 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ partyid : Factor w/ 4 levels "democrat","republican",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ income  : Factor w/ 8 levels "under $15,000",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ relign8 : Factor w/ 8 levels "protestant","catholic",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ age60   : Factor w/ 4 levels "18-29","30-44",..: 1 1 2 2 2 2 2 1 1 1 ...
##  $ age65   : Factor w/ 6 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
##  $ geocode : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ sizeplac: Factor w/ 5 levels "city over 500,000",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ brnagain: Factor w/ 2 levels "yes","no": NA NA NA NA NA NA NA NA NA NA ...
##  $ attend  : Factor w/ 6 levels "more than once a week",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ year    : num  2004 2004 2004 2004 2004 ...
##  $ region  : num  4 4 4 4 4 4 4 4 4 4 ...
##  $ y       : num  0 1 0 0 0 0 0 1 1 1 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr " 6 Jun 2007 14:53"
##  - attr(*, "formats")= chr  "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
##  - attr(*, "types")= int  251 251 251 251 251 251 251 251 251 251 ...
##  - attr(*, "val.labels")= chr  "stanum" "presak04" "sex" "race" ...
##  - attr(*, "var.labels")= chr  "state id" "in today's election for president, did you just vote for:" "are you:" "are you:" ...
##  - attr(*, "version")= int 8
##  - attr(*, "label.table")=List of 14
##   ..$ stanum  : Named num 2
##   .. ..- attr(*, "names")= chr "alaska"
##   ..$ presak04: Named num  0 1 2 3 9
##   .. ..- attr(*, "names")= chr  "did not vote" "kerry" "bush" "nader" ...
##   ..$ sex     : Named num  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ race    : Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "white" "black" "hispanic/latino" "asian" ...
##   ..$ age9    : Named num  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-44" ...
##   ..$ partyid : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "democrat" "republican" "independent" "something else"
##   ..$ income  : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "under $15,000" "$15,000-$29,999" "$30,000-$49,999" "$50,000-$74,999" ...
##   ..$ relign8 : Named num  1 2 3 4 5 6 7 8
##   .. ..- attr(*, "names")= chr  "protestant" "catholic" "mormon/lds" "other christian" ...
##   ..$ age60   : Named num  1 2 3 4
##   .. ..- attr(*, "names")= chr  "18-29" "30-44" "45-59" "60 or over"
##   ..$ age65   : Named num  1 2 3 4 5 6
##   .. ..- attr(*, "names")= chr  "18-24" "25-29" "30-39" "40-49" ...
##   ..$ geocode : Named num  1 2 3
##   .. ..- attr(*, "names")= chr  "juneau/fairbanks/rural" "anchorage" "anchorage-fairbanks corridor"
##   ..$ sizeplac: Named num  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "city over 500,000" "city: 50,000 to 500,000" "suburbs" "city: 10,000 to 49,999" ...
##   ..$ brnagain: Named num  1 2
##   .. ..- attr(*, "names")= chr  "yes" "no"
##   ..$ attend  : Named num  1 2 3 4 5 9
##   .. ..- attr(*, "names")= chr  "more than once a week" "once a week" "a few times a month" "a few times a year" ...

This is a bit messy, but it's a decently comprehensive overview.

First Steps

First Steps

Take a look at the variable “age9”

unique(mydata$age9)
##  [1] 25-29      18-24      30-39      40-44      45-49      65-74     
##  [7] 75 or over 50-59      <NA>       60-64     
## Levels: 18-24 25-29 30-39 40-44 45-49 50-59 60-64 65-74 75 or over
Binned Age True Age
1 18-24
2 25-29
3 30-39
4 40-44
5 45-49
6 50-59
7 60-64
8 65-74
9 75+

Let's look at the mean, standard deviation, median, and quintiles for “age9”

mean(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4.544
sd(x = as.numeric(mydata$age9), na.rm = T)
## [1] 2.238
median(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4
quantile(x = as.numeric(mydata$age9), na.rm = T, probs = seq(from = 0, to = 1, 
    by = 0.2))
##   0%  20%  40%  60%  80% 100% 
##    1    3    4    5    6    9

First Steps

We could do the same thing for “income”

unique(mydata$income)
## [1] <NA>            $30,000-$49,999 $15,000-$29,999 $50,000-$74,999
## [5] under $15,000   $75,000-$99,999
## 8 Levels: under $15,000 $15,000-$29,999 ... $200,000 or more
Binned Income True Income
1 $0-$14,999
2 $15,000-$29,999
3 $30,000-$49,999
4 $50,000-$74,999
5 $75,000-$99,999
6 $100,000-$149,999
7 $150,000-$199,999
8 $200,000- ∞
mean(x = as.numeric(mydata$income), na.rm = T)
## [1] 2.637
sd(x = as.numeric(mydata$income), na.rm = T)
## [1] 1.206
median(x = as.numeric(mydata$income), na.rm = T)
## [1] 3
quantile(x = as.numeric(mydata$income), na.rm = T, probs = seq(from = 0, to = 1, 
    by = 0.2))
##   0%  20%  40%  60%  80% 100% 
##    1    1    2    3    4    5

First Steps

For example…

Gender balance?

gender <- mydata$sex
gender <- gender[!is.na(gender)]
unique(gender)
## [1] female male  
## Levels: male female
males <- length(gender[gender == "male"])/length(gender) * 100
cat(males, "%", sep = "")
## 45.13%
females <- length(gender[gender == "female"])/length(gender) * 100
cat(females, "%", sep = "")
## 54.87%

For example…

Political balance?

partyid <- mydata$partyid
partyid <- partyid[!is.na(partyid)]
unique(partyid)
## [1] republican     democrat       independent    something else
## Levels: democrat republican independent something else
dem <- length(partyid[partyid == "democrat"])/length(partyid) * 100
cat(dem, "%", sep = "")
## 37.11%
rep <- length(partyid[partyid == "republican"])/length(partyid) * 100
cat(rep, "%", sep = "")
## 35.57%
ind <- length(partyid[partyid == "independent"])/length(partyid) * 100
cat(ind, "%", sep = "")
## 22.64%
other <- length(partyid[partyid == "something else"])/length(partyid) * 100
cat(other, "%", sep = "")
## 4.675%

For example…

Racial balance?

race <- mydata$race
race <- race[!is.na(race)]
unique(race)
## [1] white           black           other           asian          
## [5] hispanic/latino
## Levels: white black hispanic/latino asian other
white <- length(race[race == "white"])/length(race) * 100
cat(white, "%", sep = "")
## 80.79%
black <- length(race[race == "black"])/length(race) * 100
cat(black, "%", sep = "")
## 9.558%
hispanic <- length(race[race == "hispanic/latino"])/length(race) * 100
cat(hispanic, "%", sep = "")
## 5.909%
asian <- length(race[race == "asian"])/length(race) * 100
cat(asian, "%", sep = "")
## 1.371%
other <- length(race[race == "other"])/length(race) * 100
cat(other, "%", sep = "")
## 2.376%

This can get really complicated really quickly

How to tackle these tabulations?

All techniques for this problem rely on the split-apply-combine strategy

First, take the data (or some object) and split it into smaller datasets on the basis of some variable

Dataset A

x y z
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2

Datasets B and C (Dataset A split according to “z”)

x y z x y z
1 1 1 4 1 2
2 2 1 5 2 2
3 3 1 6 3 2

How to tackle these tabulations?

Second, apply some function to each one of the smaller datasets/objects

Example function: mean of variables “x” and “y”

Datasets B' and C'

mean(x) mean(y) z mean(x) mean(y) z
2 2 1 5 2 2

How to tackle these tabulations?

Third, combine the results into a larger dataset/object

Datasets B' and C'

mean(x) mean(y) z mean(x) mean(y) z
2 2 1 5 2 2

Dataset A'

mean(x) mean(y) z
2 2 1
5 2 2

Tabulating the hard way

Split

white.sub <- mydata[which(mydata$race == "white" & is.na(mydata$race) == F), 
    ]

black.sub <- mydata[which(mydata$race == "black" & is.na(mydata$race) == F), 
    ]

hispanic.sub <- mydata[which(mydata$race == "hispanic/latino" & is.na(mydata$race) == 
    F), ]

Apply

white.pres04 <- white.sub$pres04
white.pres04 <- white.pres04[is.element(white.pres04, c(1:3, 9)) == T]

white.k <- length(white.pres04[white.pres04 == 1])/length(white.pres04) * 100
white.b <- length(white.pres04[white.pres04 == 2])/length(white.pres04) * 100
white.n <- length(white.pres04[white.pres04 == 3])/length(white.pres04) * 100
white.o <- length(white.pres04[white.pres04 == 9])/length(white.pres04) * 100

black.pres04 <- black.sub$pres04
black.pres04 <- black.pres04[is.element(black.pres04, c(1:3, 9)) == T]

black.k <- length(black.pres04[black.pres04 == 1])/length(black.pres04) * 100
black.b <- length(black.pres04[black.pres04 == 2])/length(black.pres04) * 100
black.n <- length(black.pres04[black.pres04 == 3])/length(black.pres04) * 100
black.o <- length(black.pres04[black.pres04 == 9])/length(black.pres04) * 100

. . .

Combine

percent <- as.vector(c(white.k, white.b, white.n, white.o, black.k, black.b, 
    black.n, black.o))
vote <- rep(c("kerry", "bush", "nader", "other"), 2)
race <- c(rep("white", 4), rep("black", 4))
newdata <- data.frame(race, vote, percent)
print(newdata)
##    race  vote percent
## 1 white kerry 43.6000
## 2 white  bush 55.0246
## 3 white nader  0.6944
## 4 white other  0.6810
## 5 black kerry 87.2812
## 6 black  bush 11.6318
## 7 black nader  0.7905
## 8 black other  0.2964

. . .

Tabulating the less-hard way

Split

races = split(mydata$pres04, mydata$race)

Apply/Combine

results <- matrix(NA, nrow = length(races), ncol = length(c(1:3, 9)))
rownames(results) <- objects(races)
colnames(results) <- c("kerry", "bush", "nader", "other")
print(results)
##                 kerry bush nader other
## asian              NA   NA    NA    NA
## black              NA   NA    NA    NA
## hispanic/latino    NA   NA    NA    NA
## other              NA   NA    NA    NA
## white              NA   NA    NA    NA
for (i in 1:length(races)) {
    race.subset <- races[[i]]
    race.subset <- race.subset[is.element(race.subset, c(1:3, 9)) == T]
    results[i, "kerry"] <- length(race.subset[race.subset == 1])/length(race.subset) * 
        100
    results[i, "bush"] <- length(race.subset[race.subset == 2])/length(race.subset) * 
        100
    results[i, "nader"] <- length(race.subset[race.subset == 3])/length(race.subset) * 
        100
    results[i, "other"] <- length(race.subset[race.subset == 9])/length(race.subset) * 
        100
}
print(results)
##                 kerry  bush  nader  other
## asian           43.60 55.02 0.6944 0.6810
## black           87.28 11.63 0.7905 0.2964
## hispanic/latino 60.75 37.36 0.7750 1.1169
## other           61.43 37.68 0.6869 0.1963
## white           59.44 37.46 1.2622 1.8359

Tabulating the easy way

cleandat <- mydata[which(is.element(mydata$pres04, c(1:3, 9)) == T & is.na(mydata$race) == 
    F), ]

vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race)
##              race kerry  bush  nader  other
## 1           white 43.60 55.02 0.6944 0.6810
## 2           black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4           asian 61.43 37.68 0.6869 0.1963
## 5           other 59.44 37.46 1.2622 1.8359

plyr

plyr (Hadley Wickham, Rice) is the go-to package for all your splitting-applying-combining needs

Among its many benefits (above base R capabilities):

Using plyr: how to select functions

Two questions:

1) What is the class of your input object?

2) What is the class of your desired output object?

Using plyr: how to write commands

All of the major plyr functions have the same basic syntax

xxply(.data = , .variables = .(, , , ...), .fun = )

Consider the previous example using ddply:

vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

Using plyr: dlply

vote.by.race.dl <- dlply(.data = cleandat, .variables = .(race), summarize, 
    kerry = length(pres04[pres04 == 1])/length(pres04) * 100, bush = length(pres04[pres04 == 
        2])/length(pres04) * 100, nader = length(pres04[pres04 == 3])/length(pres04) * 
        100, other = length(pres04[pres04 == 9])/length(pres04) * 100)

print(vote.by.race.dl)
## $white
##   kerry  bush  nader other
## 1  43.6 55.02 0.6944 0.681
## 
## $black
##   kerry  bush  nader  other
## 1 87.28 11.63 0.7905 0.2964
## 
## $`hispanic/latino`
##   kerry  bush nader other
## 1 60.75 37.36 0.775 1.117
## 
## $asian
##   kerry  bush  nader  other
## 1 61.43 37.68 0.6869 0.1963
## 
## $other
##   kerry  bush nader other
## 1 59.44 37.46 1.262 1.836
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##              race
## 1           white
## 2           black
## 3 hispanic/latino
## 4           asian
## 5           other

Using plyr: ldply

races = split(cleandat, cleandat$race)

objects(races)
## [1] "asian"           "black"           "hispanic/latino" "other"          
## [5] "white"
vote.by.race.ld <- ldply(.data = races, summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race.ld)
##               .id kerry  bush  nader  other
## 1           white 43.60 55.02 0.6944 0.6810
## 2           black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4           asian 61.43 37.68 0.6869 0.1963
## 5           other 59.44 37.46 1.2622 1.8359

Using plyr: llply

vote.by.race.ll <- llply(.data = races, summarize, kerry = length(pres04[pres04 == 
    1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) * 
    100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 == 
    9])/length(pres04) * 100)

print(vote.by.race.ll)
## $white
##   kerry  bush  nader other
## 1  43.6 55.02 0.6944 0.681
## 
## $black
##   kerry  bush  nader  other
## 1 87.28 11.63 0.7905 0.2964
## 
## $`hispanic/latino`
##   kerry  bush nader other
## 1 60.75 37.36 0.775 1.117
## 
## $asian
##   kerry  bush  nader  other
## 1 61.43 37.68 0.6869 0.1963
## 
## $other
##   kerry  bush nader other
## 1 59.44 37.46 1.262 1.836

Common functions used with plyr: transform

transform: applies a function to a data frame and returns the altered version of that data frame

olddat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 == 
    "65 or over", 1, 0))

print(olddat[20:40, c("sex", "race", "age65", "old")])
##       sex  race      age65 old
## 20   male white      30-39   0
## 21   male white      40-49   0
## 22 female white      40-49   0
## 23   male white      30-39   0
## 24   male white      40-49   0
## 25   male white 65 or over   1
## 26 female white      30-39   0
## 27   male white      25-29   0
## 28   male white      30-39   0
## 29 female white      40-49   0
## 30   male white      50-64   0
## 31   male white      50-64   0
## 32 female white      40-49   0
## 33 female white      40-49   0
## 34   male white      40-49   0
## 35 female white      50-64   0
## 36   male white      40-49   0
## 37 female white      50-64   0
## 38   male white 65 or over   1
## 39   male white      18-24   0
## 40 female white      30-39   0

Note that transform can't do transformations that involve the results of other transformations from the same call

oldmendat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 == 
    "65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))
## Error: object 'old' not found

For this, we need…

Common functions used with plyr: mutate

mutate: just like transform, but it executes the commands iteratively so that transformations can be carried out that rely on previous transformations from the same call

oldmendat <- ddply(.data = cleandat, .variables = .(race), mutate, old = ifelse(age65 == 
    "65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))

print(oldmendat[40:80, c("sex", "race", "age65", "old", "old.man")])
##       sex  race      age65 old old.man
## 40 female white      30-39   0       0
## 41 female white      50-64   0       0
## 42   male white      30-39   0       0
## 43 female white      30-39   0       0
## 44 female white      40-49   0       0
## 45   male white      30-39   0       0
## 46 female white      30-39   0       0
## 47   male white      50-64   0       0
## 48 female white      25-29   0       0
## 49 female white      40-49   0       0
## 50 female white      30-39   0       0
## 51   male white      25-29   0       0
## 52   male white      30-39   0       0
## 53 female white      40-49   0       0
## 54   male white      30-39   0       0
## 55   male white      30-39   0       0
## 56   male white      40-49   0       0
## 57   male white      40-49   0       0
## 58   male white      18-24   0       0
## 59 female white      40-49   0       0
## 60 female white 65 or over   1       0
## 61 female white      50-64   0       0
## 62 female white      30-39   0       0
## 63   male white      50-64   0       0
## 64   male white      40-49   0       0
## 65   male white 65 or over   1       1
## 66   male white      30-39   0       0
## 67 female white      50-64   0       0
## 68 female white 65 or over   1       0
## 69   male white      30-39   0       0
## 70 female white      50-64   0       0
## 71 female white      30-39   0       0
## 72   male white      18-24   0       0
## 73 female white      50-64   0       0
## 74 female white      25-29   0       0
## 75   male white      40-49   0       0
## 76   male white      18-24   0       0
## 77   male white      40-49   0       0
## 78 female white      40-49   0       0
## 79 female white 65 or over   1       0
## 80   male white      40-49   0       0

Common functions used with plyr: arrange

arrange: orders a data frame on the basis of column contents

snippet <- oldmendat[1:50, c("sex", "race", "age65", "old", "old.man")]
print(snippet)
##       sex  race      age65 old old.man
## 1  female white      25-29   0       0
## 2    male white      18-24   0       0
## 3  female white      40-49   0       0
## 4  female white      30-39   0       0
## 5    male white      40-49   0       0
## 6  female white      18-24   0       0
## 7  female white      25-29   0       0
## 8  female white      18-24   0       0
## 9  female white      25-29   0       0
## 10 female white      40-49   0       0
## 11 female white      40-49   0       0
## 12   male white      40-49   0       0
## 13   male white      30-39   0       0
## 14   male white      40-49   0       0
## 15   male white      25-29   0       0
## 16 female white      25-29   0       0
## 17 female white      30-39   0       0
## 18   male white      25-29   0       0
## 19 female white      25-29   0       0
## 20   male white      30-39   0       0
## 21   male white      40-49   0       0
## 22 female white      40-49   0       0
## 23   male white      30-39   0       0
## 24   male white      40-49   0       0
## 25   male white 65 or over   1       1
## 26 female white      30-39   0       0
## 27   male white      25-29   0       0
## 28   male white      30-39   0       0
## 29 female white      40-49   0       0
## 30   male white      50-64   0       0
## 31   male white      50-64   0       0
## 32 female white      40-49   0       0
## 33 female white      40-49   0       0
## 34   male white      40-49   0       0
## 35 female white      50-64   0       0
## 36   male white      40-49   0       0
## 37 female white      50-64   0       0
## 38   male white 65 or over   1       1
## 39   male white      18-24   0       0
## 40 female white      30-39   0       0
## 41 female white      50-64   0       0
## 42   male white      30-39   0       0
## 43 female white      30-39   0       0
## 44 female white      40-49   0       0
## 45   male white      30-39   0       0
## 46 female white      30-39   0       0
## 47   male white      50-64   0       0
## 48 female white      25-29   0       0
## 49 female white      40-49   0       0
## 50 female white      30-39   0       0
arrange(df = snippet, sex, desc(age65))
##       sex  race      age65 old old.man
## 1    male white 65 or over   1       1
## 2    male white 65 or over   1       1
## 3    male white      50-64   0       0
## 4    male white      50-64   0       0
## 5    male white      50-64   0       0
## 6    male white      40-49   0       0
## 7    male white      40-49   0       0
## 8    male white      40-49   0       0
## 9    male white      40-49   0       0
## 10   male white      40-49   0       0
## 11   male white      40-49   0       0
## 12   male white      40-49   0       0
## 13   male white      30-39   0       0
## 14   male white      30-39   0       0
## 15   male white      30-39   0       0
## 16   male white      30-39   0       0
## 17   male white      30-39   0       0
## 18   male white      30-39   0       0
## 19   male white      25-29   0       0
## 20   male white      25-29   0       0
## 21   male white      25-29   0       0
## 22   male white      18-24   0       0
## 23   male white      18-24   0       0
## 24 female white      50-64   0       0
## 25 female white      50-64   0       0
## 26 female white      50-64   0       0
## 27 female white      40-49   0       0
## 28 female white      40-49   0       0
## 29 female white      40-49   0       0
## 30 female white      40-49   0       0
## 31 female white      40-49   0       0
## 32 female white      40-49   0       0
## 33 female white      40-49   0       0
## 34 female white      40-49   0       0
## 35 female white      40-49   0       0
## 36 female white      30-39   0       0
## 37 female white      30-39   0       0
## 38 female white      30-39   0       0
## 39 female white      30-39   0       0
## 40 female white      30-39   0       0
## 41 female white      30-39   0       0
## 42 female white      30-39   0       0
## 43 female white      25-29   0       0
## 44 female white      25-29   0       0
## 45 female white      25-29   0       0
## 46 female white      25-29   0       0
## 47 female white      25-29   0       0
## 48 female white      25-29   0       0
## 49 female white      18-24   0       0
## 50 female white      18-24   0       0
arrange(df = snippet, desc(sex), age65)
##       sex  race      age65 old old.man
## 1  female white      18-24   0       0
## 2  female white      18-24   0       0
## 3  female white      25-29   0       0
## 4  female white      25-29   0       0
## 5  female white      25-29   0       0
## 6  female white      25-29   0       0
## 7  female white      25-29   0       0
## 8  female white      25-29   0       0
## 9  female white      30-39   0       0
## 10 female white      30-39   0       0
## 11 female white      30-39   0       0
## 12 female white      30-39   0       0
## 13 female white      30-39   0       0
## 14 female white      30-39   0       0
## 15 female white      30-39   0       0
## 16 female white      40-49   0       0
## 17 female white      40-49   0       0
## 18 female white      40-49   0       0
## 19 female white      40-49   0       0
## 20 female white      40-49   0       0
## 21 female white      40-49   0       0
## 22 female white      40-49   0       0
## 23 female white      40-49   0       0
## 24 female white      40-49   0       0
## 25 female white      50-64   0       0
## 26 female white      50-64   0       0
## 27 female white      50-64   0       0
## 28   male white      18-24   0       0
## 29   male white      18-24   0       0
## 30   male white      25-29   0       0
## 31   male white      25-29   0       0
## 32   male white      25-29   0       0
## 33   male white      30-39   0       0
## 34   male white      30-39   0       0
## 35   male white      30-39   0       0
## 36   male white      30-39   0       0
## 37   male white      30-39   0       0
## 38   male white      30-39   0       0
## 39   male white      40-49   0       0
## 40   male white      40-49   0       0
## 41   male white      40-49   0       0
## 42   male white      40-49   0       0
## 43   male white      40-49   0       0
## 44   male white      40-49   0       0
## 45   male white      40-49   0       0
## 46   male white      50-64   0       0
## 47   male white      50-64   0       0
## 48   male white      50-64   0       0
## 49   male white 65 or over   1       1
## 50   male white 65 or over   1       1

Last note on functions

Any function will work, just remember that when writing function(x), the x that you're getting is one slice of your input object with a single value for your chosen splitting variable. Is the same class as your original object, and all functions you could apply to the original object can be applied its subsets (this may be important in the very near future, hint hint…)

Reshaping Data

Reshaping Data

##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    24.64   42.531    48.70    64.02   42.692   13.173    69.78
## 2   Austria    84.26    2.306    70.61    33.07   32.383   16.323    49.40
## 3   Belgium    50.53   40.089    77.79    55.47   34.851    4.911    25.04
## 4    Canada    82.52   36.615    13.61    11.85   47.493   58.459    65.42
## 5   Denmark    44.48   29.478    98.91    82.43    8.779    7.740    84.56
## 6    France    76.54   37.595    14.06    77.21   36.486   66.768    31.45

Reshaping Data

reshape2

Using reshape2: melt

Basic idea is to take your data frame and melt it into three essential components

1) Case identifiers (e.g.: an individual, a country)

2) Characteristic variables (e.g.: age, race, gender)

3) Variable values (e.g.: {White, Black, Hispanic/Latino, Asian, Other} for race)

Using reshape2: melt

In the example from before, our only case identifier is country, and our only characteristic variable is var (with observations from 1995-2001)

##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    24.64   42.531    48.70    64.02   42.692   13.173    69.78
## 2   Austria    84.26    2.306    70.61    33.07   32.383   16.323    49.40
## 3   Belgium    50.53   40.089    77.79    55.47   34.851    4.911    25.04
## 4    Canada    82.52   36.615    13.61    11.85   47.493   58.459    65.42
## 5   Denmark    44.48   29.478    98.91    82.43    8.779    7.740    84.56
## 6    France    76.54   37.595    14.06    77.21   36.486   66.768    31.45

Using reshape2: melt

The generic call for melt looks like this:

melt(data = , id.vars = c("    ", "     ", "    ", ...))

Note that you can also customize the names of the columns that store the variable names and values

We can melt the previous example to a nice time series by the call

wb.melted <- melt(data = wb, id.vars = "country")
print(wb.melted)
##      country variable  value
## 1  Australia var_1995 24.643
## 2    Austria var_1995 84.264
## 3    Belgium var_1995 50.526
## 4     Canada var_1995 82.521
## 5    Denmark var_1995 44.480
## 6     France var_1995 76.538
## 7  Australia var_1996 42.531
## 8    Austria var_1996  2.306
## 9    Belgium var_1996 40.089
## 10    Canada var_1996 36.615
## 11   Denmark var_1996 29.478
## 12    France var_1996 37.595
## 13 Australia var_1997 48.697
## 14   Austria var_1997 70.609
## 15   Belgium var_1997 77.792
## 16    Canada var_1997 13.610
## 17   Denmark var_1997 98.912
## 18    France var_1997 14.059
## 19 Australia var_1998 64.025
## 20   Austria var_1998 33.068
## 21   Belgium var_1998 55.468
## 22    Canada var_1998 11.850
## 23   Denmark var_1998 82.434
## 24    France var_1998 77.212
## 25 Australia var_1999 42.692
## 26   Austria var_1999 32.383
## 27   Belgium var_1999 34.851
## 28    Canada var_1999 47.493
## 29   Denmark var_1999  8.779
## 30    France var_1999 36.486
## 31 Australia var_2000 13.173
## 32   Austria var_2000 16.323
## 33   Belgium var_2000  4.911
## 34    Canada var_2000 58.459
## 35   Denmark var_2000  7.740
## 36    France var_2000 66.768
## 37 Australia var_2001 69.784
## 38   Austria var_2001 49.400
## 39   Belgium var_2001 25.042
## 40    Canada var_2001 65.417
## 41   Denmark var_2001 84.562
## 42    France var_2001 31.451
wb.melted$variable <- as.numeric(gsub(pattern = "var_", replacement = "", x = wb.melted$variable))
print(wb.melted)
##      country variable  value
## 1  Australia     1995 24.643
## 2    Austria     1995 84.264
## 3    Belgium     1995 50.526
## 4     Canada     1995 82.521
## 5    Denmark     1995 44.480
## 6     France     1995 76.538
## 7  Australia     1996 42.531
## 8    Austria     1996  2.306
## 9    Belgium     1996 40.089
## 10    Canada     1996 36.615
## 11   Denmark     1996 29.478
## 12    France     1996 37.595
## 13 Australia     1997 48.697
## 14   Austria     1997 70.609
## 15   Belgium     1997 77.792
## 16    Canada     1997 13.610
## 17   Denmark     1997 98.912
## 18    France     1997 14.059
## 19 Australia     1998 64.025
## 20   Austria     1998 33.068
## 21   Belgium     1998 55.468
## 22    Canada     1998 11.850
## 23   Denmark     1998 82.434
## 24    France     1998 77.212
## 25 Australia     1999 42.692
## 26   Austria     1999 32.383
## 27   Belgium     1999 34.851
## 28    Canada     1999 47.493
## 29   Denmark     1999  8.779
## 30    France     1999 36.486
## 31 Australia     2000 13.173
## 32   Austria     2000 16.323
## 33   Belgium     2000  4.911
## 34    Canada     2000 58.459
## 35   Denmark     2000  7.740
## 36    France     2000 66.768
## 37 Australia     2001 69.784
## 38   Austria     2001 49.400
## 39   Belgium     2001 25.042
## 40    Canada     2001 65.417
## 41   Denmark     2001 84.562
## 42    France     2001 31.451

This comes in really handy when we want to make, say, panel data graphs (though random variabes aren't that great to look at) plot of chunk unnamed-chunk-32

Using reshape2: cast (acast, dcast)

Using reshape2: cast (acast, dcast)

The generic call for (d)cast looks like this:

dcast(data = , formula = x_var1 + x_var2 ~ y_var1 + yvar2, fun.aggregate = )

Note that data frames cannot be produced with more than two dimensions

Using reshape2: cast (acast, dcast)

We can then recast our melted data from the previous example back into its original format

wb.melted$variable <- paste("var_", wb.melted$variable, sep = "")

wb.recast <- dcast(data = wb.melted, formula = country ~ variable)

print(wb.recast)
##     country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia    24.64   42.531    48.70    64.02   42.692   13.173    69.78
## 2   Austria    84.26    2.306    70.61    33.07   32.383   16.323    49.40
## 3   Belgium    50.53   40.089    77.79    55.47   34.851    4.911    25.04
## 4    Canada    82.52   36.615    13.61    11.85   47.493   58.459    65.42
## 5   Denmark    44.48   29.478    98.91    82.43    8.779    7.740    84.56
## 6    France    76.54   37.595    14.06    77.21   36.486   66.768    31.45

Regression

Running regressions in R is extremely simple, very straightforwd (though doing things with standard errors requires a little extra work)

glm

The basic glm call looks something like this:

glm(formula = y ~ x1 + x2 + x3 + ..., family = familyname(link = "linkname"), 
    data = )

There are a whole bunch of families and links to use (help(family) for a full list), but some essentials are binomial(link = “logit”), gaussian(link = “identity”), and poisson(link = “log”)

An Example

Suppose we want to regress being an old man on political party identification, income, and religion using a logit model. The glm call would be something like this:

oldman.reg <- glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"), 
    data = oldmendat)

When we store this regression in an object, we get access to several items of interest

objects(oldman.reg)
##  [1] "aic"               "boundary"          "call"             
##  [4] "coefficients"      "contrasts"         "control"          
##  [7] "converged"         "data"              "deviance"         
## [10] "df.null"           "df.residual"       "effects"          
## [13] "family"            "fitted.values"     "formula"          
## [16] "iter"              "linear.predictors" "method"           
## [19] "model"             "na.action"         "null.deviance"    
## [22] "offset"            "prior.weights"     "qr"               
## [25] "R"                 "rank"              "residuals"        
## [28] "terms"             "weights"           "xlevels"          
## [31] "y"
oldman.reg$coefficients
##            (Intercept)      partyidrepublican     partyidindependent 
##                -2.0321                 0.0907                 0.1398 
##  partyidsomething else  income$15,000-$29,999  income$30,000-$49,999 
##                -0.7690                -0.1353                -0.5616 
##  income$50,000-$74,999  income$75,000-$99,999        relign8catholic 
##                -0.8756                -0.8373                -0.5462 
##      relign8mormon/lds relign8other christian          relign8jewish 
##                -0.5397                -1.8787                 0.1290 
##          relign8muslim  relign8something else            relign8none 
##                -1.6732                -1.7342                -1.2106
oldman.reg$df.residual
## [1] 62949
oldman.reg$aic
## [1] 23042

summary.glm()

But often times, we just want a nice summary. R has a series of “summary” functions for certain object classes (“glm”, in this case) that automatically create nice overviews of the regression results from all the information contained in a fitted object.

sum.oldman.reg <- summary(oldman.reg)

print(sum.oldman.reg)
## 
## Call:
## glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.563  -0.376  -0.267  -0.193   3.239  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.0321     0.0463  -43.86  < 2e-16 ***
## partyidrepublican        0.0907     0.0449    2.02  0.04319 *  
## partyidindependent       0.1398     0.0500    2.80  0.00512 ** 
## partyidsomething else   -0.7690     0.1420   -5.42  6.1e-08 ***
## income$15,000-$29,999   -0.1353     0.0499   -2.71  0.00668 ** 
## income$30,000-$49,999   -0.5616     0.0543  -10.35  < 2e-16 ***
## income$50,000-$74,999   -0.8756     0.0552  -15.85  < 2e-16 ***
## income$75,000-$99,999   -0.8373     0.1207   -6.93  4.1e-12 ***
## relign8catholic         -0.5462     0.0461  -11.85  < 2e-16 ***
## relign8mormon/lds       -0.5397     0.1066   -5.06  4.1e-07 ***
## relign8other christian  -1.8787     0.0745  -25.21  < 2e-16 ***
## relign8jewish            0.1290     0.1071    1.20  0.22856    
## relign8muslim           -1.6732     0.4527   -3.70  0.00022 ***
## relign8something else   -1.7342     0.1249  -13.88  < 2e-16 ***
## relign8none             -1.2106     0.0794  -15.25  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24542  on 62963  degrees of freedom
## Residual deviance: 23012  on 62949  degrees of freedom
##   (11324 observations deleted due to missingness)
## AIC: 23042
## 
## Number of Fisher Scoring iterations: 7

Can also extract useful things from the summary object (like a matrix of coefficient estimates…)

objects(sum.oldman.reg)
##  [1] "aic"            "aliased"        "call"           "coefficients"  
##  [5] "contrasts"      "cov.scaled"     "cov.unscaled"   "deviance"      
##  [9] "deviance.resid" "df"             "df.null"        "df.residual"   
## [13] "dispersion"     "family"         "iter"           "na.action"     
## [17] "null.deviance"  "terms"
coef <- sum.oldman.reg$coefficients
print(coef)
##                        Estimate Std. Error z value   Pr(>|z|)
## (Intercept)             -2.0321    0.04634 -43.855  0.000e+00
## partyidrepublican        0.0907    0.04486   2.022  4.319e-02
## partyidindependent       0.1398    0.04995   2.800  5.118e-03
## partyidsomething else   -0.7690    0.14196  -5.417  6.068e-08
## income$15,000-$29,999   -0.1353    0.04988  -2.712  6.684e-03
## income$30,000-$49,999   -0.5616    0.05428 -10.347  4.326e-25
## income$50,000-$74,999   -0.8756    0.05524 -15.851  1.391e-56
## income$75,000-$99,999   -0.8373    0.12075  -6.935  4.074e-12
## relign8catholic         -0.5462    0.04610 -11.848  2.198e-32
## relign8mormon/lds       -0.5397    0.10661  -5.062  4.146e-07
## relign8other christian  -1.8787    0.07451 -25.215 2.746e-140
## relign8jewish            0.1290    0.10711   1.204  2.286e-01
## relign8muslim           -1.6732    0.45273  -3.696  2.192e-04
## relign8something else   -1.7342    0.12494 -13.881  8.284e-44
## relign8none             -1.2106    0.07940 -15.248  1.693e-52

Factors with glm

Note that, in our results, R has broken up our variables into their different factor levels (as it will do whenever your regressors have factor levels)

If your data aren't factorized, you can tell glm to factorize a variable (i.e. create dummy variables on the fly) by writing

glm(formula = y ~ x1 + x2 + factor(x3), family = family(link = "link"), data = data)

This is really helpful for doing fixed effects (although you may need to use either the package plm, or de-mean the data in advance, if you want two-way fixed effects – and careful with those standard errors)

Interactions with glm

x1:x2 interacts all terms in x1 with all terms in x2

summary(glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"), 
    data = oldmendat))
## 
## Call:
## glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.474  -0.416  -0.321  -0.190   3.018  
## 
## Coefficients: (1 not defined because of singularities)
##                                               Estimate Std. Error z value
## (Intercept)                                   -4.38670    0.35575  -12.33
## partyiddemocrat:relign8protestant              2.05179    0.35847    5.72
## partyidrepublican:relign8protestant            1.98342    0.35764    5.55
## partyidindependent:relign8protestant           2.12719    0.35953    5.92
## partyidsomething else:relign8protestant        1.26875    0.41027    3.09
## partyiddemocrat:relign8catholic                1.44507    0.36035    4.01
## partyidrepublican:relign8catholic              1.54119    0.36084    4.27
## partyidindependent:relign8catholic             1.50967    0.36320    4.16
## partyidsomething else:relign8catholic          0.70361    0.45322    1.55
## partyiddemocrat:relign8mormon/lds              1.83103    0.41190    4.45
## partyidrepublican:relign8mormon/lds            1.37272    0.38074    3.61
## partyidindependent:relign8mormon/lds           1.75105    0.42141    4.16
## partyidsomething else:relign8mormon/lds        0.84574    0.80065    1.06
## partyiddemocrat:relign8other christian         0.37790    0.36992    1.02
## partyidrepublican:relign8other christian       0.16285    0.37294    0.44
## partyidindependent:relign8other christian      0.40099    0.38279    1.05
## partyidsomething else:relign8other christian   0.07055    0.47738    0.15
## partyiddemocrat:relign8jewish                  1.92954    0.37784    5.11
## partyidrepublican:relign8jewish                2.25951    0.41859    5.40
## partyidindependent:relign8jewish               2.19307    0.40300    5.44
## partyidsomething else:relign8jewish          -10.17936  130.15390   -0.08
## partyiddemocrat:relign8muslim                  0.05597    0.79570    0.07
## partyidrepublican:relign8muslim                1.77174    0.69590    2.55
## partyidindependent:relign8muslim               0.00468    1.06727    0.00
## partyidsomething else:relign8muslim          -10.17936  202.51553   -0.05
## partyiddemocrat:relign8something else          0.24819    0.39916    0.62
## partyidrepublican:relign8something else        0.86476    0.42194    2.05
## partyidindependent:relign8something else       0.55546    0.40917    1.36
## partyidsomething else:relign8something else   -0.15659    0.61580   -0.25
## partyiddemocrat:relign8none                    0.79135    0.37254    2.12
## partyidrepublican:relign8none                  1.09171    0.38746    2.82
## partyidindependent:relign8none                 0.97190    0.37511    2.59
## partyidsomething else:relign8none                   NA         NA      NA
##                                              Pr(>|z|)    
## (Intercept)                                   < 2e-16 ***
## partyiddemocrat:relign8protestant             1.0e-08 ***
## partyidrepublican:relign8protestant           2.9e-08 ***
## partyidindependent:relign8protestant          3.3e-09 ***
## partyidsomething else:relign8protestant       0.00199 ** 
## partyiddemocrat:relign8catholic               6.1e-05 ***
## partyidrepublican:relign8catholic             1.9e-05 ***
## partyidindependent:relign8catholic            3.2e-05 ***
## partyidsomething else:relign8catholic         0.12055    
## partyiddemocrat:relign8mormon/lds             8.8e-06 ***
## partyidrepublican:relign8mormon/lds           0.00031 ***
## partyidindependent:relign8mormon/lds          3.2e-05 ***
## partyidsomething else:relign8mormon/lds       0.29082    
## partyiddemocrat:relign8other christian        0.30698    
## partyidrepublican:relign8other christian      0.66236    
## partyidindependent:relign8other christian     0.29484    
## partyidsomething else:relign8other christian  0.88251    
## partyiddemocrat:relign8jewish                 3.3e-07 ***
## partyidrepublican:relign8jewish               6.7e-08 ***
## partyidindependent:relign8jewish              5.3e-08 ***
## partyidsomething else:relign8jewish           0.93766    
## partyiddemocrat:relign8muslim                 0.94392    
## partyidrepublican:relign8muslim               0.01090 *  
## partyidindependent:relign8muslim              0.99650    
## partyidsomething else:relign8muslim           0.95991    
## partyiddemocrat:relign8something else         0.53409    
## partyidrepublican:relign8something else       0.04041 *  
## partyidindependent:relign8something else      0.17462    
## partyidsomething else:relign8something else   0.79927    
## partyiddemocrat:relign8none                   0.03365 *  
## partyidrepublican:relign8none                 0.00484 ** 
## partyidindependent:relign8none                0.00957 ** 
## partyidsomething else:relign8none                  NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26783  on 66822  degrees of freedom
## Residual deviance: 25533  on 66791  degrees of freedom
##   (7465 observations deleted due to missingness)
## AIC: 25597
## 
## Number of Fisher Scoring iterations: 13

Interactions with glm

x1*x2 produces the cross of x1 and x2, or x1+x2+x1:x2

summary(glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"), 
    data = oldmendat))
## 
## Call:
## glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"), 
##     data = oldmendat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.474  -0.416  -0.321  -0.190   3.018  
## 
## Coefficients:
##                                               Estimate Std. Error z value
## (Intercept)                                   -2.33492    0.04414  -52.90
## partyidrepublican                             -0.06837    0.05744   -1.19
## partyidindependent                             0.07540    0.06823    1.11
## partyidsomething else                         -0.78303    0.20909   -3.74
## relign8catholic                               -0.60671    0.07245   -8.37
## relign8mormon/lds                             -0.22076    0.21226   -1.04
## relign8other christian                        -1.67388    0.11060  -15.13
## relign8jewish                                 -0.12225    0.13473   -0.91
## relign8muslim                                 -1.99582    0.71311   -2.80
## relign8something else                         -1.80360    0.18633   -9.68
## relign8none                                   -1.26044    0.11908  -10.58
## partyidrepublican:relign8catholic              0.16449    0.10123    1.62
## partyidindependent:relign8catholic            -0.01080    0.11540   -0.09
## partyidsomething else:relign8catholic          0.04157    0.35479    0.12
## partyidrepublican:relign8mormon/lds           -0.38994    0.25458   -1.53
## partyidindependent:relign8mormon/lds          -0.15538    0.31431   -0.49
## partyidsomething else:relign8mormon/lds       -0.20225    0.77545   -0.26
## partyidrepublican:relign8other christian      -0.14668    0.16159   -0.91
## partyidindependent:relign8other christian     -0.05231    0.18685   -0.28
## partyidsomething else:relign8other christian   0.47568    0.39413    1.21
## partyidrepublican:relign8jewish                0.39834    0.26108    1.53
## partyidindependent:relign8jewish               0.18813    0.23815    0.79
## partyidsomething else:relign8jewish          -11.32587  130.15364   -0.09
## partyidrepublican:relign8muslim                1.78415    0.93145    1.92
## partyidindependent:relign8muslim              -0.12670    1.23440   -0.10
## partyidsomething else:relign8muslim           -9.45230  202.51658   -0.05
## partyidrepublican:relign8something else        0.68495    0.29589    2.31
## partyidindependent:relign8something else       0.23187    0.27981    0.83
## partyidsomething else:relign8something else    0.37825    0.57372    0.66
## partyidrepublican:relign8none                  0.36873    0.19774    1.86
## partyidindependent:relign8none                 0.10515    0.17617    0.60
## partyidsomething else:relign8none             -0.00832    0.42721   -0.02
##                                              Pr(>|z|)    
## (Intercept)                                   < 2e-16 ***
## partyidrepublican                             0.23393    
## partyidindependent                            0.26909    
## partyidsomething else                         0.00018 ***
## relign8catholic                               < 2e-16 ***
## relign8mormon/lds                             0.29832    
## relign8other christian                        < 2e-16 ***
## relign8jewish                                 0.36423    
## relign8muslim                                 0.00513 ** 
## relign8something else                         < 2e-16 ***
## relign8none                                   < 2e-16 ***
## partyidrepublican:relign8catholic             0.10419    
## partyidindependent:relign8catholic            0.92543    
## partyidsomething else:relign8catholic         0.90673    
## partyidrepublican:relign8mormon/lds           0.12560    
## partyidindependent:relign8mormon/lds          0.62107    
## partyidsomething else:relign8mormon/lds       0.79423    
## partyidrepublican:relign8other christian      0.36400    
## partyidindependent:relign8other christian     0.77950    
## partyidsomething else:relign8other christian  0.22746    
## partyidrepublican:relign8jewish               0.12708    
## partyidindependent:relign8jewish              0.42956    
## partyidsomething else:relign8jewish           0.93066    
## partyidrepublican:relign8muslim               0.05544 .  
## partyidindependent:relign8muslim              0.91825    
## partyidsomething else:relign8muslim           0.96277    
## partyidrepublican:relign8something else       0.02062 *  
## partyidindependent:relign8something else      0.40730    
## partyidsomething else:relign8something else   0.50970    
## partyidrepublican:relign8none                 0.06222 .  
## partyidindependent:relign8none                0.55060    
## partyidsomething else:relign8none             0.98447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26783  on 66822  degrees of freedom
## Residual deviance: 25533  on 66791  degrees of freedom
##   (7465 observations deleted due to missingness)
## AIC: 25597
## 
## Number of Fisher Scoring iterations: 13

Regression Diagnostics

The package lmtest has most of what you'll need to run basic regression diagnostics.

Breusch-Pagan Test for Heteroscedasticity

bptest(oldman.reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16

Breusch-Pagan Test for Heteroscedasticity

bptest(oldman.reg)
## 
##  studentized Breusch-Pagan test
## 
## data:  oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16

Also have tests for autocorrelation of disturbances (Durwin-Watson), higher-order serial correlation (Breusch-Godfrey)

Can also estimate heteroskedasticity/autocorrelation consistent standard errors via coeftest and the sandwich package

coeftest(x = oldman.reg, vcov. = vcovHC)
## 
## z test of coefficients:
## 
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.0321     0.0474  -42.86  < 2e-16 ***
## partyidrepublican        0.0907     0.0451    2.01  0.04440 *  
## partyidindependent       0.1398     0.0498    2.81  0.00499 ** 
## partyidsomething else   -0.7690     0.1425   -5.40  6.8e-08 ***
## income$15,000-$29,999   -0.1353     0.0497   -2.72  0.00649 ** 
## income$30,000-$49,999   -0.5616     0.0542  -10.35  < 2e-16 ***
## income$50,000-$74,999   -0.8756     0.0554  -15.81  < 2e-16 ***
## income$75,000-$99,999   -0.8373     0.1204   -6.95  3.6e-12 ***
## relign8catholic         -0.5462     0.0462  -11.81  < 2e-16 ***
## relign8mormon/lds       -0.5397     0.1072   -5.04  4.8e-07 ***
## relign8other christian  -1.8787     0.0744  -25.26  < 2e-16 ***
## relign8jewish            0.1290     0.1073    1.20  0.22938    
## relign8muslim           -1.6732     0.4559   -3.67  0.00024 ***
## relign8something else   -1.7342     0.1258  -13.79  < 2e-16 ***
## relign8none             -1.2106     0.0802  -15.09  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For panel data, the plm package also allows you to compute panel-corrected (Beck-Katz) standard errors

Breakout and overnight homework

Consider the voting preference data. Fit a logistic regression to the data for California (state number 5), modeling preference for Bush (pres04=2) vs. Kerry (pres04=1) (removing the other candidates) as a function of income, potentially including additional covariates such as sex, race and age. What do you find in terms of how income associates with voting preference?

How do you predict the probability of voting for Kerry for a given set of covariate values? Consider the predict.glm() function and what its help page says. Or write code that converts from the model coefficients to the probability scale. Compare the predicted probability of voting for Kerry for a white male of high income with a white male of low income. Do the same for white females.

If you're feeling ambitious…

Using the tools for stratified analyses we have seen today, fit separate models of voting preference as a function of (personal) income and sex (and potentially additional covariates) for each state (or a collection of states of interest to you). How do the effect of income and sex vary by state? The file stateIncome.txt is a tab-delimited file of mean income by state. See if the effects of income and sex are correlated with the state-level mean income.

For our purposes here, don't worry much about the uncertainty in the estimates you get in the logistic regression modeling, but of course in a real analysis you would need to do this.

Note that the translation from the state numbers to names can be obtained by looking at: numToName <- data.frame(name = row.names(state.x77), num = 1:50).

Breakout Answers!