STAT 152

Frequently Asked Questions

FORT EVANS PROJECT

Is each record in the dataset one family?
No, each record represents one person. If there are 3 people in a family then there will be three records in the file, one for each. There is no family identifier in the file, so families cannot be exactly matched up.
Do I take a SRS of 1,000 from the whole population?
No, you are to take a stratified cluster sample. To do this you take a sample from each of the regions in the city separately. You need to figure out how many persons you will sample from each region. the total sample size should be 1000.
How do I take a cluster sample when the data set is not in clusters?
Look at the random digit dialing example in your text. It shows how a SRS of phone numbers can be used to sample banks of 100 phone numbers where the probability is proportional to the number of residential phones.
What if I wind up with two people from the same family?
This should be a rare event -- ignore it.
How do I rake when the race codes in the data set do not match the categories supplied (white non-Hispanic, etc)?
You will need to create these categories by using two variable --race and ethnicity. For example, white non-Hispanic are those with race = 1 and ethnicity = 8, and Hispanics are those with the ethnicity code between 1 and 7. The other category should be the remainder ( you can include the NA and DKs in this group).
When I rake, do I need to weight the data according to the probability of picking a family?
No, the raking should be done at the individual person level. Make a table of the counts in your sample of the 8 categories: Male White non-Hispanic, Female White non-Hispanic, ..., Male Other, and Female Other. Use this table to find the weight-class weights using the raking methods. (The population totals are in the Fort Evans project description.)
What do I do with the records that have NA in them for everything but the region?
You just drop these records from your sample. They are your non-respondents. Your total "working sample" size will be less than 1,000.
When I rake, do I need to worry about the way that I allocated the sample sizes to the strata?
Yes, if you have allocated your sample proportional to the population totals for the strata then you need not worry. Otherwise, the counts in your table should take into consideration the allocation.
When I impute the education level for those with missing values, do I need to use the weighting scheme in selecting the records at random in the hot deck procedure?
No, you may simply choose at random with replacement from the group of persons with sex and race that match the record with the missing education. (Note that race is defined according to the variable you created in 2ii, i.e. White non-Hispanic, Black non-Hispanic, Hispanic, and Other.)
What should my estimator for the proportion of college educated adults look like?
- Start by creating a new variable that indicates whether a person is college educated or not, i.e. y_i is 1 if the person has at least a college education, and 0 otherwise. Also create an indicator for whether the person is an adult or not, i.e. x_i is 1 if the person is 18 or over and 0 otherwise.
- Use the weights from 2ii in creating your estimator. Call these rates r_i for raking weights. These are the only weights that you will need, because your estimator is at the person level, not family level.
- Find the proportion of people in each stratum that are college educated. Be sure to use your weights. This means that your estimator should be
  (sum y_i r_i)/(sum x_i r_i), where the sum is over those individuals sampled from the first ( or second) stratum.
- Use the population totals to combine your two proportions into a single estimate.
What weights do I need to use in finding the family income?
Here, you will need to take the product of the family weight (call it f_i) and the raking weight:
w_i = f_i * r_i. Also if you have not proportionally allocated then you will need adjust for the allocation method (It's probably a good idea to proportionally allocate your sample). You should probably check to see if the sum r_i over stratum 1 is reasonable close to the population total for stratum 1. If not, you may want to include a third weight to adjust for this difference.
Do I compute the median separately for each stratum and then combine them as I did with the estimate for proportion of college educated adults?
Unfortunately that won't work. Instead, you need to find that income I* such that
Sum w_i for those with incomes less than or equal to I* is
1/2 Sum w_i for all those in the sample.
How do I compute the median using weights?
Try the follwoing -- sort your weigthts acccording to salary. Then do the follwoing:
htot<-sum(wt)*.5 indexmedian <- min ((1:lngth(wt))[cumsum(wt)>htot]