The BaseballDatabank database on winnie contains 21 relations/tables. These provide information about major league baseball teams and players for the years 1884 through 2003. The database comes from and was created and licensed by Sean Lahman. We are entitled to use it for research use and cannot distribute it further. And we are grateful for his efforts in compiling and managing this database.

The attributes in the different relations are described in Baseball Archive documentation.

Other sites for baseball data include

As with the TCP connections database, you are to do some exploratory data analysis on this data. You can make up the questions that most interest you and that hopefully lead to interesting exploration of the data, of accessing the relational database, or composing and creating displays. The goal is to illustrate aspects of using a relational database, understanding how and when to perform some commands in a database and others in R, and what and how to plot to display attributes of data. You should explore one or more questions that allow you to illustrate each of these skills.

Below are some questions you might consider if you are having difficulty coming up with your own. As with the TCP data, you can get inspiration from other sources such as Web sites, newspaper articles, journals, etc. In such cases, your task is to do the necessary computations and explain them.

Visualization is a complex art. Determining how to composing plots and present different attributes requires design. Then creating the display in software is technical. You should explore these two aspects, asking questions when you want. I have provided some example plots with code to get the data and create the basic plot.

You might find the following functions and the related ones found on their help pages useful:
The lattice package may also prove useful for creating plots with different panels for levels of factors, etc.

If you think R might have a function to do something you want to do, use the and help.start() facilities in R or ask on the class bulletin board.
  1. How many people are included in the databases?
  2. Are all of these players? How many are players? how many are managers? and how many are both?
  3. What is the earliest season recorded in this database? and the most recent?
  4. What college produced the number of major league baseball players? How many colleges are there in total?
  5. Can we tell who won the "World" series in a given year?
  6. Who lost the "World" series in each year?
  7. Look at the relationship between the number of games won in a season and winning the world series? And similarly relate these to payroll.
  8. For 1999, compute the payrolls of the different teams? Can we do this for all years in a single SQL statement?
  9. Plot the payrolls over years for the different teams. What plot types are good for showing this data? Contrast different graphical techniques.

    Superimpose the payroll of the two teams that made it to the "World Series" on this plot. Is there a relationship? How about for the teams that made it into the playoffs in a year?
  10. Show the distributions of the payrolls over years. We can think of a boxplot for each year for this. Again, we can superimpose additional attributes and even lines connecting the different statistics for particular teams if they are not very noisy. boxplot(Payroll ~ Year, data = d) # Standardize by dividing by the maximum for each year d[,3] = d[,3]/maxPayroll[as.character(d[,1])] boxplot(Payroll ~ Year, data = d)
  11. Look at the payrolls for the teams that are in the same leagues? and then in the same divisions? Are there any interesting characteristics?
  12. Is the payroll related to the age of the players? One might expect an old team to be paying veteran players a lot near the end of their careers. Teams with a large number of older players would therefore have a large payroll. Is there any evidence supporting this?
  13. Look at the distribution of salaries of individual players over time for different teams.
  14. Look at players and see whether the distribution of home runs has increased over the years?
  15. Are Hall of Fame players, in general, inducted because of rare, excellent performances or years or are they rewarded for consistency over years?
  16. Are certain baseball parks better for hitting Home Runs? Can we tell from this data? Can we make inferences about this question?
  17. Do teams with a few good players and many mediocre players tend to do better than a team made up more homogeneous talent?
  18. Look at the distribution of how well batters do. Does this vary over the years? Do the same players excel each year? Is there a clustering? a bi-modal distribution?
  19. Do pitchers get better with age? Is there an improvement and then a fall off in performance? And is this related to how old they are? the number of years they have been pitching? which league they are in and the designated hitter rule? Do we have information about each of these factors and how can we combine them to present information about the general question?