Project 1

The data for this project is taken from the Social Security Administration. The data describes the top 1000 baby names used in each of the years from 1880 to 2007.

Import the data file babynames.dat from the project1 directory by using the R command babyDataURL <- url('http://www.stat.purdue.edu/~mdw/statcur/project1/babynames.dat', 'r')
Store the data in a data.frame using the R command
babyData <- read.table(babyDataURL)
Close the connection to the file using the R command
close(babyDataURL)

Create some matrices by performing the following commands:

boycounts <- matrix(babyData[[4]], nrow=1000, ncol=128, dimnames = list(c(1:1000), c(1880:2007)))
boynames <- matrix(babyData[[3]], nrow=1000, ncol=128, dimnames = list(c(1:1000), c(1880:2007)))

girlcounts <- matrix(babyData[[6]], nrow=1000, ncol=128, dimnames = list(c(1:1000), c(1880:2007)))
girlnames <- matrix(babyData[[5]], nrow=1000, ncol=128, dimnames = list(c(1:1000), c(1880:2007)))

After performing these commands, the variable boycounts will be a matrix with 1000 rows and 128 columns; the data stored in boycounts are the number of boys with a certain name. (The columns correspond to the years 1880 through 2007.) The matrix boynames contains the names. The following examples should illustrate the data:
# The following shows us that Michael was the most popular name in 1995; in fact, there were 41385 boys named Michael who were born that year.
boynames[1,"1995"]
boycounts[1,"1995"]

# The 20 most popular boy names,
# during the years 1880 through 1883, are:
boynames[1:20,1:4]
# This can also be written as:
boynames[1:20,c("1880","1881","1882","1883")]
# This is also possible by typing
boynames[1:20,as.character(1880:1883)]

# Notice that the following fails.
#(Do a mental check, to make sure that you understand why it fails.)
boynames[1:20,1880:1883]

# The 10 most popular boy names,
# during the years 2000 through 2007, are:
boynames[1:10,121:128]

# The 1000 most popular girl names from 1989, are:
girlcounts[1:15, c("1989","1990","1991","1992","1993")]
# The numbers of girls with these names are:
girlnames[1:15, c("1989","1990","1991","1992","1993")]

# The above statements could be written more succinctly, as
girlcounts[1:15, as.character(1989:1993)]
girlnames[1:15, as.character(1989:1993)]

# All of the girl names from 1976 are available by writing:
girlnames[ , "1976"]


Here are the questions for the assignment:

1a. For each decade, find the top 25 (i.e., 25 most popular) names for boys. Do not write one line of code per decade, i.e., do not use 13 lines of code that are practially repeated, one per decade. Do something more efficient.

1b. For each decade, find the top 25 names for girls. It might help to write your solution to 1a in such a way that a simple change can apply to girls instead of boys. Please consider how to do this as efficiently as possible.

1c (optional). For each decade, find the top 25 names overall. Specify, for each name, whether it is used as a boy or girl name.

2a. In 1997, consider the top ten boy names (Michael, Jacob, Matthew, Christopher, Joshua, Nicholas, Brandon, Andrew, Austin, Tyler) and the top ten girl names (Emily, Jessica, Ashley, Sarah, Hannah, Samantha, Taylor, Alexis, Elizabeth, Madison). Were there more boys within the top ten names, or were there more girls?

2b. Extend your answer from part 2a. Indicate, for each year, whether more boys were found among the top ten boy names, or more girls were found among the top ten girls names. Make a vector of the years in which there were more top ten boys, and another vector of the year in which there were more top ten girls. (So your vectors should have 128 years altogether, i.e., 128 results.)

3. Pick a particular decade (for example, 1970-1979), and a particular name (for example, Mary). Find the number of children born with that name during each of the 10 years in that decade. Display the results on a barplot.

4. Pick a name with at least 3 or more spellings (e.g., Christy, Christie, Christi) and a decade. Track the spellings of the variations during that decade. Compare the year versus the ranks of the three spellings using dotchart().

5a. In which year are the most children represented in the survey? (Combine the boy and girl names together; do not make a distinction by gender.)

5b. Repeat question 5a, but now rank the years according to the number of children represented in the survey in that way. So your answer from 5a should be the first element of your result. Find the five largest years by this measure, i.e., the 5 years in which the most children are represented in the survey.

6a. Pick a year. Make a list of all 1000 ranked girl names from that year, sorted in alphabetic order.

6b. Make an analogous list of all 1000 ranked boy names from a different year, sorted in alphabetic order.

7a. Pick a year. Consider all 1000 ranked boy names from that year. How many names begin with the letter "A"? "B"? "C"? etc.? For the purposes of this problem, the "rank of a letter" is the number of names from that year which begin with that letter. Put the 26 letters of the alphabet in order according to their letter "rank". Also include the number of names beginning with each letter.

7b. Pick a different year. This time, make the list using girl names.

8a. Create a list describing the overall 100 most popular boy names by accumulating the data for the entire 128 year period.

8b. Make an analogous list for the overall 100 most popular girl names.

9a. Find all of the girl names that occurred in all 128 of the years. (In other words, only include a name if it was found in all 128 years.)

9b. Repeat the problem, but this time for boy names.

10a. Make a list of ALL boy names with the following property: There is at least one year such that the boy name was used 10,000 or more times. For instance, the list should include Nicholas since this name occurred over 10,000 times in 1999, and the list should also include Edward since this name occurred over 10,000 times in 1922.

10b. Make an analogous list for girl names.

11a. Find every name with the property that it was used as a boy name and a girl name in the same year. For instance, Shannon was used as a name for both genders in 1938. For each name and each year, determine which gender is most commonly used with the name.

11b. Can the appropriateness of the gender change over time for any of the names? (In other words, is there any name, used for both genders, which is used for more boy names in one year and for more girl names in another year?)

12a. Let a "jump" in rank denote the increase in a name's rank (i.e., the name becomes more popular) from one year to the next year. For instance, the boy name Bruce was ranked 486 in 2003 and 478 in 2004, for a jump of +6 in rank. Find the largest 10 jumps among boy names.

12b. Find the largest 10 jumps among girl names.

13a. A "drop" in rank denotes a decrease in a name's rank (i.e., the name becomes less popular) from one year to the next year. For instance, the girl name Madison decreased from rank 3 in 2006 to rank 5 in 2007, for a drop of 2 in rank. Find the largest 10 drops among girl names.

13b. Find the largest 10 drops among boy names.

14., 15., 16. Create 3 new questions of your own, and answer each question.
Do not (for instance) ask question 14 about boys and then just make question 15 be the analogous question about girls. Each of the three questions should be essentially different.