The BaseballDatabank database on winnie contains 21 relations/tables.
These provide information about major league baseball teams and
players for the years 1884 through 2003. The database comes from
was created and licensed by Sean Lahman. We are entitled to use it
for research use and cannot distribute it further. And we are
grateful for his efforts in compiling and managing this database.
The attributes in the different relations are described in
Other sites for baseball data include
As with the TCP connections database, you are to do some exploratory
data analysis on this data. You can make up the questions that most
interest you and that hopefully lead to interesting exploration of the
data, of accessing the relational database, or composing and creating
displays. The goal is to illustrate aspects of using a relational
database, understanding how and when to perform some commands in a
database and others in R, and what and how to plot to display
attributes of data. You should explore one or more questions that
allow you to illustrate each of these skills.
Below are some questions you might consider if you are
having difficulty coming up with your own. As with the TCP data, you
can get inspiration from other sources such as Web sites, newspaper
articles, journals, etc. In such cases, your task is to do the
necessary computations and explain them.
Visualization is a complex art. Determining how to composing plots
and present different attributes requires design. Then creating the
display in software is technical. You should explore these two
aspects, asking questions when you want. I have provided some example plots
with code to get the data and
create the basic plot.
You might find the following functions and the related ones found on
their help pages useful:
boxplot(), legend(), lines(), points(),
The lattice package may also prove useful for creating plots with
different panels for levels of factors, etc.
If you think R might have a function to do something
you want to do, use the help.search()
and help.start() facilities in R or ask on the
class bulletin board.
How many people are included in the databases?
Are all of these players? How many are players? how many are
managers? and how many are both?
What is the earliest season recorded in this database? and the most
What college produced the number of major league baseball
players? How many colleges are there in total?
Can we tell who won the "World" series in a given year?
Who lost the "World" series in each year?
Look at the relationship between the number of games
won in a season and winning the world series?
And similarly relate these to payroll.
For 1999, compute the payrolls of the different teams?
Can we do this for all years in a single SQL statement?
Plot the payrolls over years for the different teams.
What plot types are good for showing this data?
Contrast different graphical techniques.
Superimpose the payroll of the two teams that made it to the "World
Series" on this plot.
Is there a relationship?
How about for the teams that made it into the
playoffs in a year?
Show the distributions of the payrolls over years. We can think of a
boxplot for each year for this. Again, we can superimpose additional
attributes and even lines connecting the different statistics for
particular teams if they are not very noisy.
boxplot(Payroll ~ Year, data = d)
# Standardize by dividing by the maximum for each year
d[,3] = d[,3]/maxPayroll[as.character(d[,1])]
boxplot(Payroll ~ Year, data = d)
Look at the payrolls for the teams
that are in the same leagues? and then in the same divisions?
Are there any interesting characteristics?
Is the payroll related to the age of the players? One might expect an
old team to be paying veteran players a lot near the end of their
careers. Teams with a large number of older players would therefore
have a large payroll. Is there any evidence supporting this?
Look at the distribution of salaries of individual
players over time for different teams.
Look at players and see whether the distribution
of home runs has increased over the years?
Are Hall of Fame players, in general, inducted because of rare,
excellent performances or years or are they rewarded for consistency
Are certain baseball parks better for hitting Home Runs?
Can we tell from this data? Can we make inferences about this
Do teams with a few good players and many mediocre players tend to do better
than a team made up more homogeneous talent?
Look at the distribution of how well batters do.
Does this vary over the years?
Do the same players excel each year?
Is there a clustering? a bi-modal distribution?
Do pitchers get better with age? Is there an improvement and then a
fall off in performance? And is this related to how old they are?
the number of years they have been pitching? which league they are in
and the designated hitter rule? Do we have information about each of
these factors and how can we combine them to present information about
the general question?