August 2013, UC Berkeley
Chris Krogslund (ckrogslund@berkeley.edu)
Note to BB: remember to start recording.
R is great at doing useful stuff and making it very, very easy
Replication data from Gelman et al., "Rich State, Poor State, Red State, Blue State: What’s the Matter with Connecticut?", Quarterly Journal of Political Science, 2007, 2: 345-367
For decades, the Democrats have been viewed as the party of the poor, with the Republicans representing the rich. Recent presidential elections, however, have shown a reverse pattern, with Democrats performing well in the richer blue states in the northeast and coasts, and Republicans dominating in the red states in the middle of the country and the south. Through multilevel modeling of individual- level survey data and county- and state-level demographic and electoral data, we reconcile these patterns. Furthermore, we find that income matters more in red America than in blue America. In poor states, rich people are much more likely than poor people to vote for the Republican presidential candidate, but in rich states (such as Connecticut), income has a very low correlation with vote preference.
Let's see what we're dealing with here.
str(mydata)
## 'data.frame': 76205 obs. of 17 variables:
## $ state : int 2 2 2 2 2 2 2 2 2 2 ...
## $ pres04 : int 1 2 1 1 1 1 1 2 2 2 ...
## $ sex : Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 2 2 ...
## $ race : Factor w/ 5 levels "white","black",..: 1 1 2 2 1 1 1 1 1 1 ...
## $ age9 : Factor w/ 9 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
## $ partyid : Factor w/ 4 levels "democrat","republican",..: NA NA NA NA NA NA NA NA NA NA ...
## $ income : Factor w/ 8 levels "under $15,000",..: NA NA NA NA NA NA NA NA NA NA ...
## $ relign8 : Factor w/ 8 levels "protestant","catholic",..: NA NA NA NA NA NA NA NA NA NA ...
## $ age60 : Factor w/ 4 levels "18-29","30-44",..: 1 1 2 2 2 2 2 1 1 1 ...
## $ age65 : Factor w/ 6 levels "18-24","25-29",..: 2 1 3 3 4 3 4 1 2 1 ...
## $ geocode : int 3 3 3 3 3 3 3 3 3 3 ...
## $ sizeplac: Factor w/ 5 levels "city over 500,000",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ brnagain: Factor w/ 2 levels "yes","no": NA NA NA NA NA NA NA NA NA NA ...
## $ attend : Factor w/ 6 levels "more than once a week",..: NA NA NA NA NA NA NA NA NA NA ...
## $ year : num 2004 2004 2004 2004 2004 ...
## $ region : num 4 4 4 4 4 4 4 4 4 4 ...
## $ y : num 0 1 0 0 0 0 0 1 1 1 ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr " 6 Jun 2007 14:53"
## - attr(*, "formats")= chr "%8.0g" "%8.0g" "%8.0g" "%8.0g" ...
## - attr(*, "types")= int 251 251 251 251 251 251 251 251 251 251 ...
## - attr(*, "val.labels")= chr "stanum" "presak04" "sex" "race" ...
## - attr(*, "var.labels")= chr "state id" "in today's election for president, did you just vote for:" "are you:" "are you:" ...
## - attr(*, "version")= int 8
## - attr(*, "label.table")=List of 14
## ..$ stanum : Named num 2
## .. ..- attr(*, "names")= chr "alaska"
## ..$ presak04: Named num 0 1 2 3 9
## .. ..- attr(*, "names")= chr "did not vote" "kerry" "bush" "nader" ...
## ..$ sex : Named num 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ race : Named num 1 2 3 4 5
## .. ..- attr(*, "names")= chr "white" "black" "hispanic/latino" "asian" ...
## ..$ age9 : Named num 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "18-24" "25-29" "30-39" "40-44" ...
## ..$ partyid : Named num 1 2 3 4
## .. ..- attr(*, "names")= chr "democrat" "republican" "independent" "something else"
## ..$ income : Named num 1 2 3 4 5 6 7 8
## .. ..- attr(*, "names")= chr "under $15,000" "$15,000-$29,999" "$30,000-$49,999" "$50,000-$74,999" ...
## ..$ relign8 : Named num 1 2 3 4 5 6 7 8
## .. ..- attr(*, "names")= chr "protestant" "catholic" "mormon/lds" "other christian" ...
## ..$ age60 : Named num 1 2 3 4
## .. ..- attr(*, "names")= chr "18-29" "30-44" "45-59" "60 or over"
## ..$ age65 : Named num 1 2 3 4 5 6
## .. ..- attr(*, "names")= chr "18-24" "25-29" "30-39" "40-49" ...
## ..$ geocode : Named num 1 2 3
## .. ..- attr(*, "names")= chr "juneau/fairbanks/rural" "anchorage" "anchorage-fairbanks corridor"
## ..$ sizeplac: Named num 1 2 3 4 5
## .. ..- attr(*, "names")= chr "city over 500,000" "city: 50,000 to 500,000" "suburbs" "city: 10,000 to 49,999" ...
## ..$ brnagain: Named num 1 2
## .. ..- attr(*, "names")= chr "yes" "no"
## ..$ attend : Named num 1 2 3 4 5 9
## .. ..- attr(*, "names")= chr "more than once a week" "once a week" "a few times a month" "a few times a year" ...
This is a bit messy, but it's a decently comprehensive overview.
Take a look at the variable "age9"
unique(mydata$age9)
## [1] 25-29 18-24 30-39 40-44 45-49 65-74
## [7] 75 or over 50-59 <NA> 60-64
## Levels: 18-24 25-29 30-39 40-44 45-49 50-59 60-64 65-74 75 or over
Binned Age | True Age ------ | ----- 1|18-24 2|25-29 3|30-39 4|40-44 5|45-49 6|50-59 7|60-64 8|65-74 9|75+
Let's look at the mean, standard deviation, median, and quintiles for "age9"
mean(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4.544
sd(x = as.numeric(mydata$age9), na.rm = T)
## [1] 2.238
median(x = as.numeric(mydata$age9), na.rm = T)
## [1] 4
quantile(x = as.numeric(mydata$age9), na.rm = T, probs = seq(from = 0, to = 1,
by = 0.2))
## 0% 20% 40% 60% 80% 100%
## 1 3 4 5 6 9
We could do the same thing for "income"
unique(mydata$income)
## [1] <NA> $30,000-$49,999 $15,000-$29,999 $50,000-$74,999
## [5] under $15,000 $75,000-$99,999
## 8 Levels: under $15,000 $15,000-$29,999 ... $200,000 or more
Binned Income | True Income ------ | ----- 1|$0-$14,999 2|$15,000-$29,999 3|$30,000-$49,999 4|$50,000-$74,999 5|$75,000-$99,999 6|$100,000-$149,999 7|$150,000-$199,999 8|$200,000- ∞
mean(x = as.numeric(mydata$income), na.rm = T)
## [1] 2.637
sd(x = as.numeric(mydata$income), na.rm = T)
## [1] 1.206
median(x = as.numeric(mydata$income), na.rm = T)
## [1] 3
quantile(x = as.numeric(mydata$income), na.rm = T, probs = seq(from = 0, to = 1,
by = 0.2))
## 0% 20% 40% 60% 80% 100%
## 1 1 2 3 4 5
Gender balance?
gender <- mydata$sex
gender <- gender[!is.na(gender)]
unique(gender)
## [1] female male
## Levels: male female
males <- length(gender[gender == "male"])/length(gender) * 100
cat(males, "%", sep = "")
## 45.13%
females <- length(gender[gender == "female"])/length(gender) * 100
cat(females, "%", sep = "")
## 54.87%
Political balance?
partyid <- mydata$partyid
partyid <- partyid[!is.na(partyid)]
unique(partyid)
## [1] republican democrat independent something else
## Levels: democrat republican independent something else
dem <- length(partyid[partyid == "democrat"])/length(partyid) * 100
cat(dem, "%", sep = "")
## 37.11%
rep <- length(partyid[partyid == "republican"])/length(partyid) * 100
cat(rep, "%", sep = "")
## 35.57%
ind <- length(partyid[partyid == "independent"])/length(partyid) * 100
cat(ind, "%", sep = "")
## 22.64%
other <- length(partyid[partyid == "something else"])/length(partyid) * 100
cat(other, "%", sep = "")
## 4.675%
Racial balance?
race <- mydata$race
race <- race[!is.na(race)]
unique(race)
## [1] white black other asian
## [5] hispanic/latino
## Levels: white black hispanic/latino asian other
white <- length(race[race == "white"])/length(race) * 100
cat(white, "%", sep = "")
## 80.79%
black <- length(race[race == "black"])/length(race) * 100
cat(black, "%", sep = "")
## 9.558%
hispanic <- length(race[race == "hispanic/latino"])/length(race) * 100
cat(hispanic, "%", sep = "")
## 5.909%
asian <- length(race[race == "asian"])/length(race) * 100
cat(asian, "%", sep = "")
## 1.371%
other <- length(race[race == "other"])/length(race) * 100
cat(other, "%", sep = "")
## 2.376%
All techniques for this problem rely on the split-apply-combine strategy
First, take the data (or some object) and split it into smaller datasets on the basis of some variable
Dataset A
x|y|z -----|------|----- 1|1|1 2|2|1 3|3|1 4|1|2 5|2|2 6|3|2
Datasets B and C (Dataset A split according to "z")
x|y|z| | | | | |x|y|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 1|1|1| | | | | |4|1|2 2|2|1| | | | | |5|2|2 3|3|1| || | | |6|3|2
Second, apply some function to each one of the smaller datasets/objects
Example function: mean of variables "x" and "y"
Datasets B' and C'
mean(x)|mean(y)|z| | | | | |mean(x)|mean(y)|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 2|2|1| | | | | |5|2|2
Third, combine the results into a larger dataset/object
Datasets B' and C'
mean(x)|mean(y)|z| | | | | |mean(x)|mean(y)|z -----|------|-----|-----|-----|-----|-----|-----|-----|-----|----- 2|2|1| | | | | |5|2|2
Dataset A'
mean(x)|mean(y)|z -----|------|----- 2|2|1 5|2|2
Split
white.sub <- mydata[which(mydata$race == "white" & is.na(mydata$race) == F),
]
black.sub <- mydata[which(mydata$race == "black" & is.na(mydata$race) == F),
]
hispanic.sub <- mydata[which(mydata$race == "hispanic/latino" & is.na(mydata$race) ==
F), ]
Apply
white.pres04 <- white.sub$pres04
white.pres04 <- white.pres04[is.element(white.pres04, c(1:3, 9)) == T]
white.k <- length(white.pres04[white.pres04 == 1])/length(white.pres04) * 100
white.b <- length(white.pres04[white.pres04 == 2])/length(white.pres04) * 100
white.n <- length(white.pres04[white.pres04 == 3])/length(white.pres04) * 100
white.o <- length(white.pres04[white.pres04 == 9])/length(white.pres04) * 100
black.pres04 <- black.sub$pres04
black.pres04 <- black.pres04[is.element(black.pres04, c(1:3, 9)) == T]
black.k <- length(black.pres04[black.pres04 == 1])/length(black.pres04) * 100
black.b <- length(black.pres04[black.pres04 == 2])/length(black.pres04) * 100
black.n <- length(black.pres04[black.pres04 == 3])/length(black.pres04) * 100
black.o <- length(black.pres04[black.pres04 == 9])/length(black.pres04) * 100
. . .
Combine
percent <- as.vector(c(white.k, white.b, white.n, white.o, black.k, black.b,
black.n, black.o))
vote <- rep(c("kerry", "bush", "nader", "other"), 2)
race <- c(rep("white", 4), rep("black", 4))
newdata <- data.frame(race, vote, percent)
print(newdata)
## race vote percent
## 1 white kerry 43.6000
## 2 white bush 55.0246
## 3 white nader 0.6944
## 4 white other 0.6810
## 5 black kerry 87.2812
## 6 black bush 11.6318
## 7 black nader 0.7905
## 8 black other 0.2964
. . .
Split
races = split(mydata$pres04, mydata$race)
Apply/Combine
results <- matrix(NA, nrow = length(races), ncol = length(c(1:3, 9)))
rownames(results) <- objects(races)
colnames(results) <- c("kerry", "bush", "nader", "other")
print(results)
## kerry bush nader other
## asian NA NA NA NA
## black NA NA NA NA
## hispanic/latino NA NA NA NA
## other NA NA NA NA
## white NA NA NA NA
for (i in 1:length(races)) {
race.subset <- races[[i]]
race.subset <- race.subset[is.element(race.subset, c(1:3, 9)) == T]
results[i, "kerry"] <- length(race.subset[race.subset == 1])/length(race.subset) *
100
results[i, "bush"] <- length(race.subset[race.subset == 2])/length(race.subset) *
100
results[i, "nader"] <- length(race.subset[race.subset == 3])/length(race.subset) *
100
results[i, "other"] <- length(race.subset[race.subset == 9])/length(race.subset) *
100
}
print(results)
## kerry bush nader other
## asian 43.60 55.02 0.6944 0.6810
## black 87.28 11.63 0.7905 0.2964
## hispanic/latino 60.75 37.36 0.7750 1.1169
## other 61.43 37.68 0.6869 0.1963
## white 59.44 37.46 1.2622 1.8359
cleandat <- mydata[which(is.element(mydata$pres04, c(1:3, 9)) == T & is.na(mydata$race) ==
F), ]
vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 ==
1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) *
100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 ==
9])/length(pres04) * 100)
print(vote.by.race)
## race kerry bush nader other
## 1 white 43.60 55.02 0.6944 0.6810
## 2 black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4 asian 61.43 37.68 0.6869 0.1963
## 5 other 59.44 37.46 1.2622 1.8359
plyr (Hadley Wickham, Rice) is the go-to package for all your splitting-applying-combining needs
Among its many benefits (above base R capabilities):
Two questions:
What is the class of your input object?
What is the class of your desired output object?
If you want to split a data frame, and return results as a data frame, you use ddply
If you want to split a data frame, and return results as a list, you use dlply
If you want to split a list, and return results as a data frame, you use ldply
All of the major plyr functions have the same basic syntax
xxply(.data = , .variables = .(, , , ...), .fun = )
Consider the previous example using ddply:
vote.by.race <- ddply(.data = cleandat, .variables = .(race), summarize, kerry = length(pres04[pres04 ==
1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) *
100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 ==
9])/length(pres04) * 100)
vote.by.race.dl <- dlply(.data = cleandat, .variables = .(race), summarize,
kerry = length(pres04[pres04 == 1])/length(pres04) * 100, bush = length(pres04[pres04 ==
2])/length(pres04) * 100, nader = length(pres04[pres04 == 3])/length(pres04) *
100, other = length(pres04[pres04 == 9])/length(pres04) * 100)
print(vote.by.race.dl)
## $white
## kerry bush nader other
## 1 43.6 55.02 0.6944 0.681
##
## $black
## kerry bush nader other
## 1 87.28 11.63 0.7905 0.2964
##
## $`hispanic/latino`
## kerry bush nader other
## 1 60.75 37.36 0.775 1.117
##
## $asian
## kerry bush nader other
## 1 61.43 37.68 0.6869 0.1963
##
## $other
## kerry bush nader other
## 1 59.44 37.46 1.262 1.836
##
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
## race
## 1 white
## 2 black
## 3 hispanic/latino
## 4 asian
## 5 other
races = split(cleandat, cleandat$race)
objects(races)
## [1] "asian" "black" "hispanic/latino" "other"
## [5] "white"
vote.by.race.ld <- ldply(.data = races, summarize, kerry = length(pres04[pres04 ==
1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) *
100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 ==
9])/length(pres04) * 100)
print(vote.by.race.ld)
## .id kerry bush nader other
## 1 white 43.60 55.02 0.6944 0.6810
## 2 black 87.28 11.63 0.7905 0.2964
## 3 hispanic/latino 60.75 37.36 0.7750 1.1169
## 4 asian 61.43 37.68 0.6869 0.1963
## 5 other 59.44 37.46 1.2622 1.8359
vote.by.race.ll <- llply(.data = races, summarize, kerry = length(pres04[pres04 ==
1])/length(pres04) * 100, bush = length(pres04[pres04 == 2])/length(pres04) *
100, nader = length(pres04[pres04 == 3])/length(pres04) * 100, other = length(pres04[pres04 ==
9])/length(pres04) * 100)
print(vote.by.race.ll)
## $white
## kerry bush nader other
## 1 43.6 55.02 0.6944 0.681
##
## $black
## kerry bush nader other
## 1 87.28 11.63 0.7905 0.2964
##
## $`hispanic/latino`
## kerry bush nader other
## 1 60.75 37.36 0.775 1.117
##
## $asian
## kerry bush nader other
## 1 61.43 37.68 0.6869 0.1963
##
## $other
## kerry bush nader other
## 1 59.44 37.46 1.262 1.836
transform: applies a function to a data frame and returns the altered version of that data frame
olddat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 ==
"65 or over", 1, 0))
print(olddat[20:40, c("sex", "race", "age65", "old")])
## sex race age65 old
## 20 male white 30-39 0
## 21 male white 40-49 0
## 22 female white 40-49 0
## 23 male white 30-39 0
## 24 male white 40-49 0
## 25 male white 65 or over 1
## 26 female white 30-39 0
## 27 male white 25-29 0
## 28 male white 30-39 0
## 29 female white 40-49 0
## 30 male white 50-64 0
## 31 male white 50-64 0
## 32 female white 40-49 0
## 33 female white 40-49 0
## 34 male white 40-49 0
## 35 female white 50-64 0
## 36 male white 40-49 0
## 37 female white 50-64 0
## 38 male white 65 or over 1
## 39 male white 18-24 0
## 40 female white 30-39 0
Note that transform can't do transformations that involve the results of other transformations from the same call
oldmendat <- ddply(.data = cleandat, .variables = .(race), transform, old = ifelse(age65 ==
"65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))
## Error: object 'old' not found
For this, we need...
mutate: just like transform, but it executes the commands iteratively so that transformations can be carried out that rely on previous transformations from the same call
oldmendat <- ddply(.data = cleandat, .variables = .(race), mutate, old = ifelse(age65 ==
"65 or over", 1, 0), old.man = ifelse(old == 1 & sex == "male", 1, 0))
print(oldmendat[40:80, c("sex", "race", "age65", "old", "old.man")])
## sex race age65 old old.man
## 40 female white 30-39 0 0
## 41 female white 50-64 0 0
## 42 male white 30-39 0 0
## 43 female white 30-39 0 0
## 44 female white 40-49 0 0
## 45 male white 30-39 0 0
## 46 female white 30-39 0 0
## 47 male white 50-64 0 0
## 48 female white 25-29 0 0
## 49 female white 40-49 0 0
## 50 female white 30-39 0 0
## 51 male white 25-29 0 0
## 52 male white 30-39 0 0
## 53 female white 40-49 0 0
## 54 male white 30-39 0 0
## 55 male white 30-39 0 0
## 56 male white 40-49 0 0
## 57 male white 40-49 0 0
## 58 male white 18-24 0 0
## 59 female white 40-49 0 0
## 60 female white 65 or over 1 0
## 61 female white 50-64 0 0
## 62 female white 30-39 0 0
## 63 male white 50-64 0 0
## 64 male white 40-49 0 0
## 65 male white 65 or over 1 1
## 66 male white 30-39 0 0
## 67 female white 50-64 0 0
## 68 female white 65 or over 1 0
## 69 male white 30-39 0 0
## 70 female white 50-64 0 0
## 71 female white 30-39 0 0
## 72 male white 18-24 0 0
## 73 female white 50-64 0 0
## 74 female white 25-29 0 0
## 75 male white 40-49 0 0
## 76 male white 18-24 0 0
## 77 male white 40-49 0 0
## 78 female white 40-49 0 0
## 79 female white 65 or over 1 0
## 80 male white 40-49 0 0
arrange: orders a data frame on the basis of column contents
snippet <- oldmendat[1:50, c("sex", "race", "age65", "old", "old.man")]
print(snippet)
## sex race age65 old old.man
## 1 female white 25-29 0 0
## 2 male white 18-24 0 0
## 3 female white 40-49 0 0
## 4 female white 30-39 0 0
## 5 male white 40-49 0 0
## 6 female white 18-24 0 0
## 7 female white 25-29 0 0
## 8 female white 18-24 0 0
## 9 female white 25-29 0 0
## 10 female white 40-49 0 0
## 11 female white 40-49 0 0
## 12 male white 40-49 0 0
## 13 male white 30-39 0 0
## 14 male white 40-49 0 0
## 15 male white 25-29 0 0
## 16 female white 25-29 0 0
## 17 female white 30-39 0 0
## 18 male white 25-29 0 0
## 19 female white 25-29 0 0
## 20 male white 30-39 0 0
## 21 male white 40-49 0 0
## 22 female white 40-49 0 0
## 23 male white 30-39 0 0
## 24 male white 40-49 0 0
## 25 male white 65 or over 1 1
## 26 female white 30-39 0 0
## 27 male white 25-29 0 0
## 28 male white 30-39 0 0
## 29 female white 40-49 0 0
## 30 male white 50-64 0 0
## 31 male white 50-64 0 0
## 32 female white 40-49 0 0
## 33 female white 40-49 0 0
## 34 male white 40-49 0 0
## 35 female white 50-64 0 0
## 36 male white 40-49 0 0
## 37 female white 50-64 0 0
## 38 male white 65 or over 1 1
## 39 male white 18-24 0 0
## 40 female white 30-39 0 0
## 41 female white 50-64 0 0
## 42 male white 30-39 0 0
## 43 female white 30-39 0 0
## 44 female white 40-49 0 0
## 45 male white 30-39 0 0
## 46 female white 30-39 0 0
## 47 male white 50-64 0 0
## 48 female white 25-29 0 0
## 49 female white 40-49 0 0
## 50 female white 30-39 0 0
arrange(df = snippet, sex, desc(age65))
## sex race age65 old old.man
## 1 male white 65 or over 1 1
## 2 male white 65 or over 1 1
## 3 male white 50-64 0 0
## 4 male white 50-64 0 0
## 5 male white 50-64 0 0
## 6 male white 40-49 0 0
## 7 male white 40-49 0 0
## 8 male white 40-49 0 0
## 9 male white 40-49 0 0
## 10 male white 40-49 0 0
## 11 male white 40-49 0 0
## 12 male white 40-49 0 0
## 13 male white 30-39 0 0
## 14 male white 30-39 0 0
## 15 male white 30-39 0 0
## 16 male white 30-39 0 0
## 17 male white 30-39 0 0
## 18 male white 30-39 0 0
## 19 male white 25-29 0 0
## 20 male white 25-29 0 0
## 21 male white 25-29 0 0
## 22 male white 18-24 0 0
## 23 male white 18-24 0 0
## 24 female white 50-64 0 0
## 25 female white 50-64 0 0
## 26 female white 50-64 0 0
## 27 female white 40-49 0 0
## 28 female white 40-49 0 0
## 29 female white 40-49 0 0
## 30 female white 40-49 0 0
## 31 female white 40-49 0 0
## 32 female white 40-49 0 0
## 33 female white 40-49 0 0
## 34 female white 40-49 0 0
## 35 female white 40-49 0 0
## 36 female white 30-39 0 0
## 37 female white 30-39 0 0
## 38 female white 30-39 0 0
## 39 female white 30-39 0 0
## 40 female white 30-39 0 0
## 41 female white 30-39 0 0
## 42 female white 30-39 0 0
## 43 female white 25-29 0 0
## 44 female white 25-29 0 0
## 45 female white 25-29 0 0
## 46 female white 25-29 0 0
## 47 female white 25-29 0 0
## 48 female white 25-29 0 0
## 49 female white 18-24 0 0
## 50 female white 18-24 0 0
arrange(df = snippet, desc(sex), age65)
## sex race age65 old old.man
## 1 female white 18-24 0 0
## 2 female white 18-24 0 0
## 3 female white 25-29 0 0
## 4 female white 25-29 0 0
## 5 female white 25-29 0 0
## 6 female white 25-29 0 0
## 7 female white 25-29 0 0
## 8 female white 25-29 0 0
## 9 female white 30-39 0 0
## 10 female white 30-39 0 0
## 11 female white 30-39 0 0
## 12 female white 30-39 0 0
## 13 female white 30-39 0 0
## 14 female white 30-39 0 0
## 15 female white 30-39 0 0
## 16 female white 40-49 0 0
## 17 female white 40-49 0 0
## 18 female white 40-49 0 0
## 19 female white 40-49 0 0
## 20 female white 40-49 0 0
## 21 female white 40-49 0 0
## 22 female white 40-49 0 0
## 23 female white 40-49 0 0
## 24 female white 40-49 0 0
## 25 female white 50-64 0 0
## 26 female white 50-64 0 0
## 27 female white 50-64 0 0
## 28 male white 18-24 0 0
## 29 male white 18-24 0 0
## 30 male white 25-29 0 0
## 31 male white 25-29 0 0
## 32 male white 25-29 0 0
## 33 male white 30-39 0 0
## 34 male white 30-39 0 0
## 35 male white 30-39 0 0
## 36 male white 30-39 0 0
## 37 male white 30-39 0 0
## 38 male white 30-39 0 0
## 39 male white 40-49 0 0
## 40 male white 40-49 0 0
## 41 male white 40-49 0 0
## 42 male white 40-49 0 0
## 43 male white 40-49 0 0
## 44 male white 40-49 0 0
## 45 male white 40-49 0 0
## 46 male white 50-64 0 0
## 47 male white 50-64 0 0
## 48 male white 50-64 0 0
## 49 male white 65 or over 1 1
## 50 male white 65 or over 1 1
Any function will work, just remember that when writing function(x), the x that you're getting is one slice of your input object with a single value for your chosen splitting variable. Is the same class as your original object, and all functions you could apply to the original object can be applied its subsets (this may be important in the very near future, hint hint...)
Often times, even before we're interested in doing all this group-wise stuff, we need to reshape our data
For instance, say you download a huge dataset from the World Bank, IMF, OECD, etc.
There's a good chance your data will look something like this:
## country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia 79.87 78.58 99.02 80.71 51.56 38.850 46.85
## 2 Austria 17.29 59.17 23.12 73.41 22.06 3.348 42.14
## 3 Belgium 59.12 62.54 23.11 91.34 71.09 10.812 63.65
## 4 Canada 28.37 84.69 6.49 73.54 74.89 85.872 90.31
## 5 Denmark 80.23 21.18 45.45 21.14 98.16 65.903 77.76
## 6 France 39.23 44.18 89.71 50.10 91.92 69.633 50.06
If you've ever tried to make a time series graph of any kind, you know this can be really annoying
But fear not...
Once again, Hadley Wickham has made all our lives easier, this time with reshape2
Though base R does have commands for reshaping data (aggregate, by, tapply, etc.), each of their input commands are slightly different and are only suited for specific reshaping tasks
reshape2 really only has two commands, and their functions are in their names: melt and cast
Basic idea is to take your data frame and melt it into three essential components
Case identifiers (e.g.: an individual, a country)
Characteristic variables (e.g.: age, race, gender)
Variable values (e.g.: {White, Black, Hispanic/Latino, Asian, Other} for race)
In the example from before, our only case identifier is country, and our only characteristic variable is var (with observations from 1995-2001)
## country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia 79.87 78.58 99.02 80.71 51.56 38.850 46.85
## 2 Austria 17.29 59.17 23.12 73.41 22.06 3.348 42.14
## 3 Belgium 59.12 62.54 23.11 91.34 71.09 10.812 63.65
## 4 Canada 28.37 84.69 6.49 73.54 74.89 85.872 90.31
## 5 Denmark 80.23 21.18 45.45 21.14 98.16 65.903 77.76
## 6 France 39.23 44.18 89.71 50.10 91.92 69.633 50.06
The generic call for melt looks like this:
melt(data = , id.vars = c(" ", " ", " ", ...))
Note that you can also customize the names of the columns that store the variable names and values
We can melt the previous example to a nice time series by the call
wb.melted <- melt(data = wb, id.vars = "country")
print(wb.melted)
## country variable value
## 1 Australia var_1995 79.871
## 2 Austria var_1995 17.289
## 3 Belgium var_1995 59.116
## 4 Canada var_1995 28.370
## 5 Denmark var_1995 80.233
## 6 France var_1995 39.233
## 7 Australia var_1996 78.579
## 8 Austria var_1996 59.174
## 9 Belgium var_1996 62.541
## 10 Canada var_1996 84.695
## 11 Denmark var_1996 21.177
## 12 France var_1996 44.178
## 13 Australia var_1997 99.024
## 14 Austria var_1997 23.122
## 15 Belgium var_1997 23.114
## 16 Canada var_1997 6.490
## 17 Denmark var_1997 45.448
## 18 France var_1997 89.707
## 19 Australia var_1998 80.709
## 20 Austria var_1998 73.408
## 21 Belgium var_1998 91.341
## 22 Canada var_1998 73.542
## 23 Denmark var_1998 21.139
## 24 France var_1998 50.097
## 25 Australia var_1999 51.560
## 26 Austria var_1999 22.057
## 27 Belgium var_1999 71.092
## 28 Canada var_1999 74.895
## 29 Denmark var_1999 98.161
## 30 France var_1999 91.917
## 31 Australia var_2000 38.850
## 32 Austria var_2000 3.348
## 33 Belgium var_2000 10.812
## 34 Canada var_2000 85.872
## 35 Denmark var_2000 65.903
## 36 France var_2000 69.633
## 37 Australia var_2001 46.845
## 38 Austria var_2001 42.143
## 39 Belgium var_2001 63.649
## 40 Canada var_2001 90.307
## 41 Denmark var_2001 77.762
## 42 France var_2001 50.064
wb.melted$variable <- as.numeric(gsub(pattern = "var_", replacement = "", x = wb.melted$variable))
print(wb.melted)
## country variable value
## 1 Australia 1995 79.871
## 2 Austria 1995 17.289
## 3 Belgium 1995 59.116
## 4 Canada 1995 28.370
## 5 Denmark 1995 80.233
## 6 France 1995 39.233
## 7 Australia 1996 78.579
## 8 Austria 1996 59.174
## 9 Belgium 1996 62.541
## 10 Canada 1996 84.695
## 11 Denmark 1996 21.177
## 12 France 1996 44.178
## 13 Australia 1997 99.024
## 14 Austria 1997 23.122
## 15 Belgium 1997 23.114
## 16 Canada 1997 6.490
## 17 Denmark 1997 45.448
## 18 France 1997 89.707
## 19 Australia 1998 80.709
## 20 Austria 1998 73.408
## 21 Belgium 1998 91.341
## 22 Canada 1998 73.542
## 23 Denmark 1998 21.139
## 24 France 1998 50.097
## 25 Australia 1999 51.560
## 26 Austria 1999 22.057
## 27 Belgium 1999 71.092
## 28 Canada 1999 74.895
## 29 Denmark 1999 98.161
## 30 France 1999 91.917
## 31 Australia 2000 38.850
## 32 Austria 2000 3.348
## 33 Belgium 2000 10.812
## 34 Canada 2000 85.872
## 35 Denmark 2000 65.903
## 36 France 2000 69.633
## 37 Australia 2001 46.845
## 38 Austria 2001 42.143
## 39 Belgium 2001 63.649
## 40 Canada 2001 90.307
## 41 Denmark 2001 77.762
## 42 France 2001 50.064
This comes in really handy when we want to make, say, panel data graphs (though random variabes aren't that great to look at)
Sometimes, once we've melted down the data, we want to recast it
There are two main functions for this: acast (for producing arrays) and dcast (for producing data frames)
The generic call for (d)cast looks like this:
dcast(data = , formula = x_var1 + x_var2 ~ y_var1 + yvar2, fun.aggregate = )
Note that data frames cannot be produced with more than two dimensions
We can then recast our melted data from the previous example back into its original format
wb.melted$variable <- paste("var_", wb.melted$variable, sep = "")
wb.recast <- dcast(data = wb.melted, formula = country ~ variable)
print(wb.recast)
## country var_1995 var_1996 var_1997 var_1998 var_1999 var_2000 var_2001
## 1 Australia 79.87 78.58 99.02 80.71 51.56 38.850 46.85
## 2 Austria 17.29 59.17 23.12 73.41 22.06 3.348 42.14
## 3 Belgium 59.12 62.54 23.11 91.34 71.09 10.812 63.65
## 4 Canada 28.37 84.69 6.49 73.54 74.89 85.872 90.31
## 5 Denmark 80.23 21.18 45.45 21.14 98.16 65.903 77.76
## 6 France 39.23 44.18 89.71 50.10 91.92 69.633 50.06
Running regressions in R is extremely simple, very straightforwd (though doing things with standard errors requires a little extra work)
Most basic, catch-all regression function in R is glm
glm fits a generalized linear model with your choice of family/link function (gaussian, logit, poisson, etc.)
lm is just a standard linear regression (equivalent to glm with family=gaussian(link="identity"))
The basic glm call looks something like this:
glm(formula = y ~ x1 + x2 + x3 + ..., family = familyname(link = "linkname"),
data = )
There are a whole bunch of families and links to use (help(family) for a full list), but some essentials are binomial(link = "logit"), gaussian(link = "identity"), and poisson(link = "log")
Suppose we want to regress being an old man on political party identification, income, and religion using a logit model. The glm call would be something like this:
oldman.reg <- glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"),
data = oldmendat)
When we store this regression in an object, we get access to several items of interest
objects(oldman.reg)
## [1] "aic" "boundary" "call"
## [4] "coefficients" "contrasts" "control"
## [7] "converged" "data" "deviance"
## [10] "df.null" "df.residual" "effects"
## [13] "family" "fitted.values" "formula"
## [16] "iter" "linear.predictors" "method"
## [19] "model" "na.action" "null.deviance"
## [22] "offset" "prior.weights" "qr"
## [25] "R" "rank" "residuals"
## [28] "terms" "weights" "xlevels"
## [31] "y"
oldman.reg$coefficients
## (Intercept) partyidrepublican partyidindependent
## -2.0321 0.0907 0.1398
## partyidsomething else income$15,000-$29,999 income$30,000-$49,999
## -0.7690 -0.1353 -0.5616
## income$50,000-$74,999 income$75,000-$99,999 relign8catholic
## -0.8756 -0.8373 -0.5462
## relign8mormon/lds relign8other christian relign8jewish
## -0.5397 -1.8787 0.1290
## relign8muslim relign8something else relign8none
## -1.6732 -1.7342 -1.2106
oldman.reg$df.residual
## [1] 62949
oldman.reg$aic
## [1] 23042
But often times, we just want a nice summary. R has a series of "summary" functions for certain object classes ("glm", in this case) that automatically create nice overviews of the regression results from all the information contained in a fitted object.
sum.oldman.reg <- summary(oldman.reg)
print(sum.oldman.reg)
##
## Call:
## glm(formula = old.man ~ partyid + income + relign8, family = binomial(link = "logit"),
## data = oldmendat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.563 -0.376 -0.267 -0.193 3.239
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0321 0.0463 -43.86 < 2e-16 ***
## partyidrepublican 0.0907 0.0449 2.02 0.04319 *
## partyidindependent 0.1398 0.0500 2.80 0.00512 **
## partyidsomething else -0.7690 0.1420 -5.42 6.1e-08 ***
## income$15,000-$29,999 -0.1353 0.0499 -2.71 0.00668 **
## income$30,000-$49,999 -0.5616 0.0543 -10.35 < 2e-16 ***
## income$50,000-$74,999 -0.8756 0.0552 -15.85 < 2e-16 ***
## income$75,000-$99,999 -0.8373 0.1207 -6.93 4.1e-12 ***
## relign8catholic -0.5462 0.0461 -11.85 < 2e-16 ***
## relign8mormon/lds -0.5397 0.1066 -5.06 4.1e-07 ***
## relign8other christian -1.8787 0.0745 -25.21 < 2e-16 ***
## relign8jewish 0.1290 0.1071 1.20 0.22856
## relign8muslim -1.6732 0.4527 -3.70 0.00022 ***
## relign8something else -1.7342 0.1249 -13.88 < 2e-16 ***
## relign8none -1.2106 0.0794 -15.25 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24542 on 62963 degrees of freedom
## Residual deviance: 23012 on 62949 degrees of freedom
## (11324 observations deleted due to missingness)
## AIC: 23042
##
## Number of Fisher Scoring iterations: 7
Can also extract useful things from the summary object (like a matrix of coefficient estimates...)
objects(sum.oldman.reg)
## [1] "aic" "aliased" "call" "coefficients"
## [5] "contrasts" "cov.scaled" "cov.unscaled" "deviance"
## [9] "deviance.resid" "df" "df.null" "df.residual"
## [13] "dispersion" "family" "iter" "na.action"
## [17] "null.deviance" "terms"
coef <- sum.oldman.reg$coefficients
print(coef)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0321 0.04634 -43.855 0.000e+00
## partyidrepublican 0.0907 0.04486 2.022 4.319e-02
## partyidindependent 0.1398 0.04995 2.800 5.118e-03
## partyidsomething else -0.7690 0.14196 -5.417 6.068e-08
## income$15,000-$29,999 -0.1353 0.04988 -2.712 6.684e-03
## income$30,000-$49,999 -0.5616 0.05428 -10.347 4.326e-25
## income$50,000-$74,999 -0.8756 0.05524 -15.851 1.391e-56
## income$75,000-$99,999 -0.8373 0.12075 -6.935 4.074e-12
## relign8catholic -0.5462 0.04610 -11.848 2.198e-32
## relign8mormon/lds -0.5397 0.10661 -5.062 4.146e-07
## relign8other christian -1.8787 0.07451 -25.215 2.746e-140
## relign8jewish 0.1290 0.10711 1.204 2.286e-01
## relign8muslim -1.6732 0.45273 -3.696 2.192e-04
## relign8something else -1.7342 0.12494 -13.881 8.284e-44
## relign8none -1.2106 0.07940 -15.248 1.693e-52
Note that, in our results, R has broken up our variables into their different factor levels (as it will do whenever your regressors have factor levels)
If your data aren't factorized, you can tell glm to factorize a variable (i.e. create dummy variables on the fly) by writing
glm(formula = y ~ x1 + x2 + factor(x3), family = family(link = "link"), data = data)
This is really helpful for doing fixed effects (although you may need to use either the package plm, or de-mean the data in advance, if you want two-way fixed effects -- and careful with those standard errors)
x1:x2 interacts all terms in x1 with all terms in x2
summary(glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"),
data = oldmendat))
##
## Call:
## glm(formula = old.man ~ partyid:relign8, family = binomial(link = "logit"),
## data = oldmendat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.474 -0.416 -0.321 -0.190 3.018
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -4.38670 0.35575 -12.33
## partyiddemocrat:relign8protestant 2.05179 0.35847 5.72
## partyidrepublican:relign8protestant 1.98342 0.35764 5.55
## partyidindependent:relign8protestant 2.12719 0.35953 5.92
## partyidsomething else:relign8protestant 1.26875 0.41027 3.09
## partyiddemocrat:relign8catholic 1.44507 0.36035 4.01
## partyidrepublican:relign8catholic 1.54119 0.36084 4.27
## partyidindependent:relign8catholic 1.50967 0.36320 4.16
## partyidsomething else:relign8catholic 0.70361 0.45322 1.55
## partyiddemocrat:relign8mormon/lds 1.83103 0.41190 4.45
## partyidrepublican:relign8mormon/lds 1.37272 0.38074 3.61
## partyidindependent:relign8mormon/lds 1.75105 0.42141 4.16
## partyidsomething else:relign8mormon/lds 0.84574 0.80065 1.06
## partyiddemocrat:relign8other christian 0.37790 0.36992 1.02
## partyidrepublican:relign8other christian 0.16285 0.37294 0.44
## partyidindependent:relign8other christian 0.40099 0.38279 1.05
## partyidsomething else:relign8other christian 0.07055 0.47738 0.15
## partyiddemocrat:relign8jewish 1.92954 0.37784 5.11
## partyidrepublican:relign8jewish 2.25951 0.41859 5.40
## partyidindependent:relign8jewish 2.19307 0.40300 5.44
## partyidsomething else:relign8jewish -10.17936 130.15390 -0.08
## partyiddemocrat:relign8muslim 0.05597 0.79570 0.07
## partyidrepublican:relign8muslim 1.77174 0.69590 2.55
## partyidindependent:relign8muslim 0.00468 1.06727 0.00
## partyidsomething else:relign8muslim -10.17936 202.51553 -0.05
## partyiddemocrat:relign8something else 0.24819 0.39916 0.62
## partyidrepublican:relign8something else 0.86476 0.42194 2.05
## partyidindependent:relign8something else 0.55546 0.40917 1.36
## partyidsomething else:relign8something else -0.15659 0.61580 -0.25
## partyiddemocrat:relign8none 0.79135 0.37254 2.12
## partyidrepublican:relign8none 1.09171 0.38746 2.82
## partyidindependent:relign8none 0.97190 0.37511 2.59
## partyidsomething else:relign8none NA NA NA
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## partyiddemocrat:relign8protestant 1.0e-08 ***
## partyidrepublican:relign8protestant 2.9e-08 ***
## partyidindependent:relign8protestant 3.3e-09 ***
## partyidsomething else:relign8protestant 0.00199 **
## partyiddemocrat:relign8catholic 6.1e-05 ***
## partyidrepublican:relign8catholic 1.9e-05 ***
## partyidindependent:relign8catholic 3.2e-05 ***
## partyidsomething else:relign8catholic 0.12055
## partyiddemocrat:relign8mormon/lds 8.8e-06 ***
## partyidrepublican:relign8mormon/lds 0.00031 ***
## partyidindependent:relign8mormon/lds 3.2e-05 ***
## partyidsomething else:relign8mormon/lds 0.29082
## partyiddemocrat:relign8other christian 0.30698
## partyidrepublican:relign8other christian 0.66236
## partyidindependent:relign8other christian 0.29484
## partyidsomething else:relign8other christian 0.88251
## partyiddemocrat:relign8jewish 3.3e-07 ***
## partyidrepublican:relign8jewish 6.7e-08 ***
## partyidindependent:relign8jewish 5.3e-08 ***
## partyidsomething else:relign8jewish 0.93766
## partyiddemocrat:relign8muslim 0.94392
## partyidrepublican:relign8muslim 0.01090 *
## partyidindependent:relign8muslim 0.99650
## partyidsomething else:relign8muslim 0.95991
## partyiddemocrat:relign8something else 0.53409
## partyidrepublican:relign8something else 0.04041 *
## partyidindependent:relign8something else 0.17462
## partyidsomething else:relign8something else 0.79927
## partyiddemocrat:relign8none 0.03365 *
## partyidrepublican:relign8none 0.00484 **
## partyidindependent:relign8none 0.00957 **
## partyidsomething else:relign8none NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26783 on 66822 degrees of freedom
## Residual deviance: 25533 on 66791 degrees of freedom
## (7465 observations deleted due to missingness)
## AIC: 25597
##
## Number of Fisher Scoring iterations: 13
x1*x2 produces the cross of x1 and x2, or x1+x2+x1:x2
summary(glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"),
data = oldmendat))
##
## Call:
## glm(formula = old.man ~ partyid * relign8, family = binomial(link = "logit"),
## data = oldmendat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.474 -0.416 -0.321 -0.190 3.018
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.33492 0.04414 -52.90
## partyidrepublican -0.06837 0.05744 -1.19
## partyidindependent 0.07540 0.06823 1.11
## partyidsomething else -0.78303 0.20909 -3.74
## relign8catholic -0.60671 0.07245 -8.37
## relign8mormon/lds -0.22076 0.21226 -1.04
## relign8other christian -1.67388 0.11060 -15.13
## relign8jewish -0.12225 0.13473 -0.91
## relign8muslim -1.99582 0.71311 -2.80
## relign8something else -1.80360 0.18633 -9.68
## relign8none -1.26044 0.11908 -10.58
## partyidrepublican:relign8catholic 0.16449 0.10123 1.62
## partyidindependent:relign8catholic -0.01080 0.11540 -0.09
## partyidsomething else:relign8catholic 0.04157 0.35479 0.12
## partyidrepublican:relign8mormon/lds -0.38994 0.25458 -1.53
## partyidindependent:relign8mormon/lds -0.15538 0.31431 -0.49
## partyidsomething else:relign8mormon/lds -0.20225 0.77545 -0.26
## partyidrepublican:relign8other christian -0.14668 0.16159 -0.91
## partyidindependent:relign8other christian -0.05231 0.18685 -0.28
## partyidsomething else:relign8other christian 0.47568 0.39413 1.21
## partyidrepublican:relign8jewish 0.39834 0.26108 1.53
## partyidindependent:relign8jewish 0.18813 0.23815 0.79
## partyidsomething else:relign8jewish -11.32587 130.15364 -0.09
## partyidrepublican:relign8muslim 1.78415 0.93145 1.92
## partyidindependent:relign8muslim -0.12670 1.23440 -0.10
## partyidsomething else:relign8muslim -9.45230 202.51658 -0.05
## partyidrepublican:relign8something else 0.68495 0.29589 2.31
## partyidindependent:relign8something else 0.23187 0.27981 0.83
## partyidsomething else:relign8something else 0.37825 0.57372 0.66
## partyidrepublican:relign8none 0.36873 0.19774 1.86
## partyidindependent:relign8none 0.10515 0.17617 0.60
## partyidsomething else:relign8none -0.00832 0.42721 -0.02
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## partyidrepublican 0.23393
## partyidindependent 0.26909
## partyidsomething else 0.00018 ***
## relign8catholic < 2e-16 ***
## relign8mormon/lds 0.29832
## relign8other christian < 2e-16 ***
## relign8jewish 0.36423
## relign8muslim 0.00513 **
## relign8something else < 2e-16 ***
## relign8none < 2e-16 ***
## partyidrepublican:relign8catholic 0.10419
## partyidindependent:relign8catholic 0.92543
## partyidsomething else:relign8catholic 0.90673
## partyidrepublican:relign8mormon/lds 0.12560
## partyidindependent:relign8mormon/lds 0.62107
## partyidsomething else:relign8mormon/lds 0.79423
## partyidrepublican:relign8other christian 0.36400
## partyidindependent:relign8other christian 0.77950
## partyidsomething else:relign8other christian 0.22746
## partyidrepublican:relign8jewish 0.12708
## partyidindependent:relign8jewish 0.42956
## partyidsomething else:relign8jewish 0.93066
## partyidrepublican:relign8muslim 0.05544 .
## partyidindependent:relign8muslim 0.91825
## partyidsomething else:relign8muslim 0.96277
## partyidrepublican:relign8something else 0.02062 *
## partyidindependent:relign8something else 0.40730
## partyidsomething else:relign8something else 0.50970
## partyidrepublican:relign8none 0.06222 .
## partyidindependent:relign8none 0.55060
## partyidsomething else:relign8none 0.98447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 26783 on 66822 degrees of freedom
## Residual deviance: 25533 on 66791 degrees of freedom
## (7465 observations deleted due to missingness)
## AIC: 25597
##
## Number of Fisher Scoring iterations: 13
The package lmtest has most of what you'll need to run basic regression diagnostics.
Breusch-Pagan Test for Heteroscedasticity
bptest(oldman.reg)
##
## studentized Breusch-Pagan test
##
## data: oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16
Breusch-Pagan Test for Heteroscedasticity
bptest(oldman.reg)
##
## studentized Breusch-Pagan test
##
## data: oldman.reg
## BP = 1497, df = 14, p-value < 2.2e-16
Also have tests for autocorrelation of disturbances (Durwin-Watson), higher-order serial correlation (Breusch-Godfrey)
Can also estimate heteroskedasticity/autocorrelation consistent standard errors via coeftest and the sandwich package
coeftest(x = oldman.reg, vcov. = vcovHC)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0321 0.0474 -42.86 < 2e-16 ***
## partyidrepublican 0.0907 0.0451 2.01 0.04440 *
## partyidindependent 0.1398 0.0498 2.81 0.00499 **
## partyidsomething else -0.7690 0.1425 -5.40 6.8e-08 ***
## income$15,000-$29,999 -0.1353 0.0497 -2.72 0.00649 **
## income$30,000-$49,999 -0.5616 0.0542 -10.35 < 2e-16 ***
## income$50,000-$74,999 -0.8756 0.0554 -15.81 < 2e-16 ***
## income$75,000-$99,999 -0.8373 0.1204 -6.95 3.6e-12 ***
## relign8catholic -0.5462 0.0462 -11.81 < 2e-16 ***
## relign8mormon/lds -0.5397 0.1072 -5.04 4.8e-07 ***
## relign8other christian -1.8787 0.0744 -25.26 < 2e-16 ***
## relign8jewish 0.1290 0.1073 1.20 0.22938
## relign8muslim -1.6732 0.4559 -3.67 0.00024 ***
## relign8something else -1.7342 0.1258 -13.79 < 2e-16 ***
## relign8none -1.2106 0.0802 -15.09 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For panel data, the plm package also allows you to compute panel-corrected (Beck-Katz) standard errors
Consider the voting preference data. Fit a logistic regression to the data for California (state number 5), modeling preference for Bush (pres04=2) vs. Kerry (pres04=1) (removing the other candidates) as a function of income, potentially including additional covariates such as sex, race and age. What do you find in terms of how income associates with voting preference?
How do you predict the probability of voting for Kerry for a given set of covariate values? Consider the predict.glm()
function and what its help page says. Or write code that converts from the model coefficients to the probability scale. Compare the predicted probability of voting for Kerry for a white male of high income with a white male of low income. Do the same for white females.
If you're feeling ambitious...
Using the tools for stratified analyses we have seen today, fit separate models of voting preference as a function of (personal) income and sex (and potentially additional covariates) for each state (or a collection of states of interest to you). How do the effect of income and sex vary by state? The file stateIncome.txt is a tab-delimited file of mean income by state. See if the effects of income and sex are correlated with the state-level mean income.
For our purposes here, don't worry much about the uncertainty in the estimates you get in the logistic regression modeling, but of course in a real analysis you would need to do this.
Note that the translation from the state numbers to names can be obtained by looking at: numToName <- data.frame(name = row.names(state.x77), num = 1:50)
.