Spreadsheets and Databases

1  Spreadsheets

Spreadsheets, in particular the "xls" format used by Microsoft Excel, are one of the most common (if not the most common) methods for transfering data from one organization to another. When you download or receive a spreadsheet, it's often a good idea to make sure that it really is a spreadsheet. Comma-separated files, which are easily read into spreadsheet programs, are often erroneously labeled "spreadsheets" or "Excel files", and may even have an incorrect extension of ".xls". In addition, Excel spreadsheets are often used to hold largely textual information, so simply finding a spreadsheet that appears to contain data can be misleading. Furthermore, many systems are configured to use a spreadsheet program to automatically open files with an extension of ".csv", further adding to the confusion.
For many years, the only reliable way to read an Excel spreadsheet was with Microsoft Excel, which is only available on Windows and Mac OS computers; users on UNIX were left with no option other than finding a different computer. In the last few years, a number of programs, notably gnumeric and OpenOffice.org (usually available through the ooffice command) have been developed through careful reverse engineering to allow Unix users the ability to work with these files. To insure its advantage in the marketplace, Microsoft doesn't publish a detailed description of exactly how it creates its spreadsheet files. There are often "secret" ways of doing things that are only known to the developers within Microsoft. Reverse engineering means looking at the way a program handles different kinds of files, and then trying to write a program that imitates what the other program does. The end result of all this is that there are several ways to get the data from spreadsheet files into R.
Spreadsheets are organized as a collection of one or more sheets, stored in a single file. Each of the sheets represents a rectangular display of rows and columns, not unlike a data frame. Unfortunately, people often embed text, graphics and pictures into spreadsheets, so it's not a good idea to assume that a spreadsheet has the consistent structure of a data frame or matrix, even if portions of it do appear to have that structure. In addition, spreadsheets often have a large amount of header information (more than just a list of variable names), sometimes making it challenging to figure out the correct variable names to use for the different columns in the spreadsheet.
Our main concern will be in getting the information from the spreadsheet into R: we won't look at how to work with data within Excel.
If you are simply working with a single Excel spreadsheet, the easiest way to get it into R is to open the spreadsheet with any compatible program, go to the File menu, and select Save As. When the file selector dialog appears, notice that you will be given an option of different file format choices, usually through a drop down menu. Choose something like "Tab delimited text", or "CSV (Comma separated values)". If any of the fields in the spreadsheet contain commas, tab delimited is a better choice, but generally it doesn't matter which one you use. Now, the file is (more or less) ready to be read into R, using read.csv or read.delim. A very common occurence is for fields stored in a spreadsheet to contain either single or double quotes, in which case the quote= argument can be used to make sure R understands what you want. Spreadsheets often arrange their data so that there are a different number of valid entries on different lines, which will confuse read.table; the fill= argument may be of use in these cases. Finally, as mentioned previously, there are often multiple descriptive lines of text at the top of a spreadsheet, so some experimentation with the skip= argument to read.table may be necessary. Alternatively, you can simply delete those lines from the file into which you save the tab- or comma-delimited data.
As you can see, this is a somewhat time-consuming process which needs to be customized for each spreadsheet. It's further complicated by the fact that it only handles one sheet of a multi-sheet spreadsheet; in those cases the process would need to be repeated for each sheet. If you need to read a number of spreadsheets, especially ones where you need to access the data from more than one sheet, you need to use a more programmable solution.
R provides several methods to read spreadsheets without having to save them as tab- or comma-delimited files. The first is the read.xls function in the gdata library. This function uses a Perl program (included as part of the library), that converts the spreadsheet to a comma-separated file, and then uses read.csv to convert it to a data frame. In order for this library to work, perl must also be installed on the computer that's running R. perl will be installed on virtually any Unix-based computer (including Mac OSX), but will most likely not be on most Windows computers (although once perl is installed on a Windows system, the read.xls function will work with no problems). However there is a Windows-only package , xlsReadWrite, available from CRAN which can both read and write Excel files. (On other platforms, the dataframes2xls package provides a write.xls function.)
Remember, for just one or two data tables from a spreadsheet, saving as a tab- or comma-delimited file and using the correct read.table variant will usually be the easiest route.
Since read.xls does nothing more than convert the spreadsheet to comma-separated form and then call read.csv, the guidelines for using it are very similar to the normal use of read.csv. The most useful feature of read.xls is that it accepts an optional sheet= argument, allowing multiple sheets from a spreadsheet file to be automatically read.
As a simple example, consider a spreadsheet with listings for the top 200 educational institutions in the US with respect to the number of postdoctoral students, which I found at http://mup.asu.edu/Top200-III/2_2005_top200_postdoc.xls
Here's a view of how the data looks in the UNIX gnumeric spreadsheet:
Here are the first few lines of the file created when I use File->Save As, and choose the CSV format:
"The Top 200 Institutions--Postdoctoral Appointees
(2005)",,,,,
,,,,,
"Top 50 Institutions
in Postdoctoral Appointees
(2005)","Number of Postdocs","National Rank","Control Rank",,"Institutional
Control"
"Harvard University",4384,1,1,,Private
"Johns Hopkins University",1442,2,2,,Private
"Stanford University",1259,3,3,,Private
"University of California - Los Angeles",1094,4,1,,Public
"Yale University",1012,5,4,,Private
"University of California - San Francisco",1003,6,2,,Public
"University of Washington - Seattle",963,7,3,,Public
"University of California - San Diego",886,8,4,,Public
"Massachusetts Institute of Technology",851,9,5,,Private
"University of Pennsylvania",815,10,6,,Private
"Columbia University",793,11,7,,Private
"University of California - Berkeley",774,12,5,,Public

While the data itself looks fine, it's unlikely that the header information will be of much use. Since there are 7 header lines, I could use the following statements to read the data into R:
> fromcsv = read.csv('2_2005_top200_postdoc.csv',header=FALSE,skip=7,stringsAsFactors=FALSE)
> dim(fromcsv)
[1] 206   6
> head(fromcsv)
                                        V1   V2 V3 V4 V5      V6
1                       Harvard University 4384  1  1 NA Private
2                 Johns Hopkins University 1442  2  2 NA Private
3                      Stanford University 1259  3  3 NA Private
4   University of California - Los Angeles 1094  4  1 NA  Public
5                          Yale University 1012  5  4 NA Private
6 University of California - San Francisco 1003  6  2 NA  Public

The fifth column can be removed (take a look at the spreadsheet to see why), and we should supply some names to the spreadsheet. One problem with this spreadsheet is that it repeats its header information several times part way through the spreadsheet. These lines will have to be removed manually. The can be identified by the fact that the sixth variable should be either Public or Private, and could be eliminated as follows:
> fromcsv = fromcsv[fromcsv$V6 %in% c('Public','Private'),]

The fifth variable can be removed by setting it to NULL, and the columns can be named.
> fromcsv$V5 = NULL
> names(fromcsv) = c('Institution','NPostdocs','Rank','ControlRank','Control')

Finally, a check of the columns shows that, because of the header problem, the numeric variables explicitly need to be converted:
> sapply(fromcsv,class)
Institution   NPostdocs        Rank ControlRank     Control
"character" "character" "character" "character" "character"
> fromcsv[,c(2,3,4)] = sapply(fromcsv[,c(2,3,4)],as.numeric)

A similar procedure could be carried out with read.xls, although it might take some experimentation with skip= to make things work properly. The only other difference between read.xls and read.csv for this example is that a space mysteriously appeared at the end of one of the variables when using read.xls:
library(gdata)
fromreadxls = read.xls('2_2005_top200_postdoc.xls',stringsAsFactors=FALSE,header=FALSE,skip=2)
fromreadxls = fromreadxls[-1,-c(1,6)]
fromreadxls$V7 = sub(' $','',fromreadxls$V7)
fromreadxls = fromreadxls[fromreadxls$V7 %in% c('Public','Private'),]
fromreadxls[,c(2,3,4)] = sapply(fromreadxls[,c(2,3,4)],as.numeric)
names(fromreadxls) = c('Institution','NPostdocs','Rank','ControlRank','Control')

2  Databases

A database is a collection of data, usually with some information (sometimes called metadata) about how the data is organized. But many times when people refer to a database, they mean a database server, which is similar to a web server, but responds to requests for data instead of web pages.
By far the most common type of database server is known a a relational database management system (RDBMS), and the most common way of communicating with such a database is a language known as SQL, which is an acronym for Structured Query Language. Some examples of database systems that use SQL to communicate with an RDBMS include Oracle, Sybase, Microsoft SQL Server, SQLite, MySQL and Postgres. While there is an SQL standard, each database manufacturer provides some additional features, so SQL that works with one database is not guaranteed to work on another. We'll try to stick to aspects of SQL that should be available on most SQL based systems.
A Database consists of one or more tables, which are generally stored as files on the computer that's running the DBMS. A table is a rectangular array, where each row represents an observation, and each column represents a variable. Most databases consists of several tables, each containing information about one aspect of the data. For example, consider a database to hold information about the parts needed to build a variety of different products. One way to store the information would be to have a data table that had the part id, its description, the supplier's name, and the cost of the part. An immediate problem with this is scheme concerns how we could store information about the relation between products and parts. Furthermore, if one supplier provided many different parts, redundant information about suppliers would be repeated many times in the data table. In the 1970s when database technology was first being developed, disk and memory space were limited and expensive, and organizing data in this way was not very efficient, especially as the size of the data base increased. In the late 1970s, the idea of a relational database was developed by IBM, with the first commercial offering coming from a company which is now known as Oracle. Relational database design is governed by a principle known as normalization. While entire books are devoted to the subject, the basic idea of normalization is to try and remove redundancy as much as possible when creating the tables that make up a data base. Continuing the parts example, a properly normalized database to hold the parts information would consist of four tables. The first would contain a part id to uniquely identify the part, its description, price and an id to identify the supplier. A second table would contain the supplier codes and any other information about the supplier. A third table would contain product ids and descriptions, while the final table would have one record for each part used in each product, stored as pairs of product id and part id. The variables that link together the different databases are refered to as keys, or sometimes foreign keys. Clearly, making sure that there are keys to link information from one table to another is critical to the idea of a normalized data base. This allows large amounts of information to be stored in manageably sized tables which can be modified and updated without having to change all of the information in all of the tables. Such tables will be efficient in the amount of disk and memory resources that they need, which was critical at the time such databases were developed. Also critical to this scheme is a fast and efficient way of joining together these tables so that queries like "Which suppliers are used for product xyz?" or "What parts from Acme Machine Works cost between $2 and $4" or "What is the total cost of the parts needed to make product xyz?". In fact, for many years the only programs that were capable of combining data from multiple sources were RDBMSs.
We're going to look at the way the SQL language is used to extract information from RDBMSs, specifically the open source SQLite data base (http://sqlite.org) (which doesn't require a database server), and the open source MySQL data base (http://www.mysql.com) which does. In addition we'll see how to do typical database operations in R, and how to access a database using SQL from inside of R.
One of the first questions that arises when thinking about databases is "When should I use a database?" There are several cases where a database makes sense:
  1. The only source for the data you need may be an existing database.
  2. The data you're working with changes on a regular basis, especially if it can potentially be changed by many people. Database servers (like most other servers) provide concurrency, which takes care of problems that might otherwise arise if more than one person tries to access or modify a database at the same time.
  3. Your data is simply too big to fit into the computer's memory, but if you could conveniently extract just the parts you need, you'd be able to work with the data. For example, you may have hundreds of variables, but only need to work with a small number at at time, or you have tens of thousands of observations, but only need to work with a few thousand at a time.
One of the most important uses of databases in the last ten years has been as back ends to dynamic web sites. If you've ever placed an order online, joined an online discussion community, or gotten a username/password combination to access web resources, you've most likely interacted with a database. In addition both the Firefox and Safari browsers us SQLite databases to store cookies, certificates, and other information.
There are actually three parts to a RDBMS system: data definition, data access, and privelege management. We're only going to look at the data access aspect of databases; we'll assume that a database is already available, and that the administrator of the database has provided you with access to the resources you need. You can communicate with a database in at least three different ways:
  1. You can use a command line program that will display its results in a way similar to the way R works: you type in a query, and the results are displayed.
  2. You can use a graphical interface (GUI) that will display the results and give you the option to export them in different forms, such as a comma-separated data file. One advantage to these clients is that they usually provide a display of the available databases, along with the names of the tables in the database and the names of the columns within the tables.
  3. You can use a library within R to query the database and return the results of the query into an R data frame. For use with MySQL library, the RMySQL library is available for download from CRAN; for SQLite, the RSQLite package is available. When you install either, the DBI library, which provides a consistent interface to different databases, will also be installed.
It's very important to understand that SQL is not a programming language - it's said to be a declarative language. Instead of solving problems by writing a series of instructions or putting together programming statements into functions, with SQL you write a single statement (query) that will return some or all of the records from a database. SQL was designed to be easy to use for non-programmers, allowing them to present queries to the database in a language that is more similar to a spoken language than a programming language. The main tool for data access is the select statement. Since everything needs to be done with this single statement, descriptions of its syntax can be quite daunting. For example, here's an example that shows some of the important features of the select statement:
SELECT columns or computations
       FROM table
       WHERE condition
       GROUP BY columns
       HAVING condition
       ORDER BY column [ASC | DESC]
       LIMIT offset,count;

(The keywords in the above statement can be typed in upper or lower case, but I'll generally use upper case to remind us that they are keywords.) MySQL is case sensistive to table and column names, but not for commands. SQLite is not case sensitive at all. SQL statements always end with a semi-colon, but graphical clients and interfaces in other languages sometimes don't require the semi-colon. Fortunately, you rarely need to use all of these commands in a single query. In fact, you may be able to get by knowing just the command that returns all of the information from a table in a database:
       SELECT * from table;

3  Working with Databases

When you're not using a graphical interface, it's easy to get lost in a database considering that each database can hold multiple tables, and each table can have multiple columns. Three commands that are useful in finding your way around a database are:
MySQLSQLite
SHOW DATABASES;.databases
SHOW TABLES IN database;.tables
SHOW COLUMNS IN table;pragma table_info(table)
DESCRIBE table;.schema table
For use as an example, consider a database to keep track of a music collection, stored as an SQLite database in the file albums.db. (You can download the database from http://www.stat.berkeley.edu/~spector/s133/albums.db) We can use the commands just shown to see how the database is organized. Here, I'm using the UNIX command-line client called sqlite3:
springer.spector$ sqlite3 albums.db 
SQLite version 3.4.2
Enter ".help" for instructions
sqlite> .mode column
sqlite> .header on
sqlite> .tables
Album   Artist  Track 
sqlite> .schema Album
CREATE TABLE Album 
( alid INTEGER,
	aid REAL,
	title TEXT 
);
CREATE INDEX jj on Album(alid);
sqlite> .schema Artist
CREATE TABLE Artist 
( aid REAL,
	name TEXT 
);
CREATE INDEX ii on Artist(aid);
sqlite> .schema Track
CREATE TABLE Track 
( alid INTEGER,
	aid REAL,
	title TEXT,
	filesize INTEGER,
	bitrate REAL,
	length INTEGER 
);

Studying these tables we can see that the organization is as follows: each track in the Track table is identified by an artist id (aid) and an album id (alid). To get the actual names of the artists, we can link to the Artist database through the aid key; to get the names of the albums, we can link to the Album table through the alid key. These keys are integers, which don't take up very much space; the actual text of the album title or artist's name is stored only once in the table devoted to that purpose. (This is the basic principle of normalization that drives database design.) The first thing to notice in schemes like this is that the individual tables won't allow us to really see anything iteresting, since they just contain keys, not actual values; to get interesting information, we need to join together multiple tables. For example, let's say we want a list of album titles and artist names. We want to extract the Name from the Artist table and combine it with Title from the Album table, based on a matching value of the aid variable:
sqlite> .width 25 50
sqlite> SELECT Artist.name, Album.title FROM Artist, Album WHERE Artist.aid = Album.aid;
name                       title                                             
-------------------------  --------------------------------------------------
Jimmy McGriff              100% Pure Funk                                    
Jimmy McGriff              A Bag Full Of Soul                                
Jimmy McGriff              Cherry                                            
Jimmy McGriff              Countdown                                         
Jimmy McGriff              Electric Funk                                     
Jimmy McGriff              The Starting Five                                 
Tiny Parham                1926-1929                                         
Tiny Parham                1929-1940                                         
Jimmie Noone               1928-1929                                         
Ethel Waters, Benny Goodm  1929-1939                                         
Cab Calloway               1930-1931                                         
Cab Calloway               Best Of Big Bands                                 
                         . . . 

Queries like this are surprisingly fast - they are exactly what the database has been designed and optimized to acheive.


File translated from TEX by TTH, version 3.67.
On 4 Mar 2009, 15:35.