Using MySQL with R

1 Accessing databases in R

To provide consistent access to different databases, R uses an intermediate layer of communication known as the DBI (Data Base Interface) library. As we've seen before, different commands are necessary to get information about different databases, so the DBI layer provides a set of routines that will work identically for all databases. Some of these routines are described in the following table.

`dbConnect`	connect to a database
`dbDisconnect`	close the connection to a database
`dbExistsTable`	returns TRUE or FALSE
`dbGetQuery`	send a query to the database and get the results
`dbListTables`	shows all the tables in a database
`dbListFields`	shows the names of columns in a database
`dbSendQuery`	send a query to the database and use `fetch` to get results
`dbWriteTable`	stores a data frame in a database

To let the DBI layer know what type of database you are using, you pass a string (like 'MySQL' or 'SQLite') to the dbDriver function. This will return an object which can be passed to dbConnect to make a connection to the database. dbConnect returns a connection object which is passed to all the other routines so that they can communicate with the database you are using.

For example, to connect to the albums SQLite database, we'd use the following statements:

drv = dbDriver('SQLite')
con = dbConnect(drv,db='albums.db')

Since SQLite doesn't require a database server, you can easily create SQLite databases using dbWriteTable. For very large data sets, you can process the data in chunks, using the append= option of dbWriteTable.

To use MySQL, you must have a valid username and password. For the MySQL database running on springer.berkeley.edu, you'd need to use the following statements:

drv = dbDriver('MySQL')
con = dbConnect(drv,dbname='albums',user='stat133',pass='T0pSecr3t')

If you're using an SCF machine, you'd need to specify a hostname (springer) when connecting; for non-SCF machines, you would set up an SSH tunnel and use a hostname of '127.0.0.1'.

Once you get the connection object from dbConnect you can use it repeatedly to make as many queries as you need, and the functions you use are the same, regardless of the nature of the underlying database.

It's up to you to decide how much of the work you want to do through the database, and how much you'll do in R. For example, suppose we want to find albums in the albums database whose total time (sum of the length variable in the Track table) is between 2400 and 2700 seconds, displayed in descending order of total time.

One approach (after loading the library and making the connection to the database) would be to do all the work in the database. Notice that this may be your only option if the database you're dealing with is too large to fit into R.

> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS tot\
+                                 FROM Album AS al,Artist AS ar,Track AS tr\
+                                 WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid\
+                                 GROUP BY tr.alid\
+                                 HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC')
> head(albums)
                           title                           name  tot
1                      'Perk Up'                   Shelly Manne 2684
2                   Kaleidoscope                    Sonny Stitt 2679
3 Red Garland's Piano (Remastere                    Red Garland 2676
4                   Ask The Ages                 Sonny Sharrock 2675
5                            Duo   Charlie Hunter & Leon Parker 2667
6                 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665

Note the use of a single backslash at the end of each line of input to allow typing the query over multiple lines. Alternatively, the entire query could be passed as one long string. The dbGetQuery function gets the results of the query and puts them into a data frame, in this case called albums.

At the other end of the extreme, you could read in the entire database and work with it entirely in R. This may not be the wisest approach, because if the database is too large, you may have problems. Working with the data in this way, however will allow us to see the R equivalent of some of the database operations we've been studying. Once again, I'm assuming that the RMySQL library has been loaded, and a connection object has been obtained through a call to dbConnect.

> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")

We now have three data frames in R corresponding to the three data tables in the database. In R, the equivalent of an inner join can be carried out with the merge function. You provide merge with the data frames you wish to join, and a character string or vector of character strings with the name(s) of the variables that you wish to join them by. In database terminology, these names are the keys for the join. The merge function will only accept two data frames, but it can be called repeatedly in case there are more than two data frames that need to be merged.

If there are variables with the same name (other than the keys) in the two data frames to be merged, merge will rename them by adding either .x or .y to the end of the name to tell you which data frame that variable came from. An often easier approach is to rename variables before merging so that all the names (other than the keys) will be unique. In the current example, the variable title is in both the album data frame and the track data frame. Let's rename the title variable in the album data frame to albumname:

> names(album)
[1] "alid"  "aid"   "title" "year"
> names(album)[3] = 'albumname'

Now we can merge the album data frame with the artist data frame using aid as a key, and then merge the track data frame with the album data frame using alid as a key.

> album = merge(album,artist,by='aid')
> track = merge(track,album,by='alid')
> head(track)
  alid aid.x                        title.x filesize bitrate length aid.y
1    1   439                 Makin' Whoopee     5870     256    188   439
2    1   439  People Will Say We're In Love     6020     256    193   439
3    1   439 Until The Real Thing Comes Alo     5588     256    179   439
4    1   439    Nice Work If You Can Get It     4948     256    158   439
5    1   439      My Heart Belongs To Daddy     4998     256    160   439
6    1   439           What Is There To Say     6340     256    203   439
                 title.y            name
1 Songs In A Mellow Mood Ella Fitzgerald
2 Songs In A Mellow Mood Ella Fitzgerald
3 Songs In A Mellow Mood Ella Fitzgerald
4 Songs In A Mellow Mood Ella Fitzgerald
5 Songs In A Mellow Mood Ella Fitzgerald
6 Songs In A Mellow Mood Ella Fitzgerald

To calculate the total length of tracks, we only need the album and artist names along with the length of each track:

> track = track[,c('albumname','name','length')]

The rest of the processing is pretty straightforward now that everything is in R:

> tot = aggregate(track$length,list(albumname=track$albumname,artist=track$name),sum)

The previous command would be very slow, because aggregate considers all possible combinations of albumname and artist, which is very inefficient for data like this. An alternative is to form a single grouping variable and then to break off the artist and album names after aggregation:

> track$albart = paste(track$albumname,track$name,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist= sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x < 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
        x              album                         artist
2397 2698          Up & Down                  Horace Parlan
999  2697        In New York     Cannonball Adderley Sextet
1534 2697        Night Dance                  Jimmy Giuffre
2502 2694              Yaina Pucho and His Latin Soul Broth
1367 2693 Mambo Moves Garner                   Errol Garner
1998 2693       Swing Street Hoard Alden Dan Barrett Quinte

A different approach illustrates the relationship between factors and the principles of normalization. Since a factor stores each of its levels only once, normalized tables can often be used to form factors in R. Once again, let's assume that we've loaded the RMySQL library and obtained a connection object. This time, I'll trim off the unneeded variables from the track data frame before processing:

> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album =  factor(track$album,levels=album$alid,labels=album$title)

The tables formed in the process of normalizing the database essentially contain the levels and labels of a factor variable in R. Processing the data set is the same as in the previous example:

> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x < 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
          x                          album                         artist
  2435 2698                      Up & Down                  Horace Parlan
  1000 2697                    In New York     Cannonball Adderley Sextet
  1568 2697                    Night Dance                  Jimmy Giuffre
  1401 2693             Mambo Moves Garner                   Errol Garner
  2035 2693                   Swing Street Hoard Alden Dan Barrett Quinte
  1253 2692 Kodachrome: Raymond Scott Comp            Metropole Orchestra

In practice, the most useful solutions would probably fall in between the extreme of doing everything on the database and doing everything in R.

2 Using SQL in R

If you like using SQL, or you see a handy way to get something done using SQL, you can use the sqldf package to operate on R data frames using SQL. When you pass a query to the sqldf function, it will identify the data frames involved in the query, create a temporary SQLite database, perform the query, and return the results. For example, consider the task of merging two data frames, based on the value of a variable id:

> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
  id  x  y
1  1  7 35
2  4 12 29
3  2 19 21
4  3 15 32
5  5  9 19

3 Reading Spreadsheets with the `RODBC` Library

Now that we've got the basics of SQL down, we can look at a third option for reading spreadsheets into R, the RODBC library. Note that this functionality is only available on Windows and you don't need a copy of Excel on your computer in order for it to work. The ODBC interface provides access to spreadsheets using SQL as a query language. Here's a sample session showing how we could read the spreadsheet from the previous example using RODBC:

> library(RODBC)
> con = odbcConnectExcel('2_2000_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
tbls
                                         TABLE_CAT TABLE_SCHEM
1 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc        <NA>
2 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc        <NA>
                         TABLE_NAME TABLE_TYPE REMARKS
1             'Postdocs -Top 200 $'      TABLE    <NA>
2 'Postdocs -Top 200 $'Print_Titles      TABLE    <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,as.is=TRUE)
> head(postdocs)
  F1 The Top 200 Institutions--Postdoctoral Appointees_(2000)   F3 F4 F5 F6
1 NA                                                     <NA>   NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2000)   NA NA NA NA
3 NA                                       Harvard University 3491  1  1 NA
4 NA                                      Stanford University 1196  2  2 NA
5 NA                                       Yeshiva University 1122  3  3 NA
6 NA                                 Johns Hopkins University 1029  4  4 NA
                      F7 F8 F9 F10 F11 F12 F13
1                   <NA> NA NA  NA  NA  NA  NA
2 Institutional\nControl NA NA  NA  NA  NA  NA
3                Private NA NA  NA  NA  NA  NA
4                Private NA NA  NA  NA  NA  NA
5                Private NA NA  NA  NA  NA  NA
6                Private NA NA  NA  NA  NA  NA

> postdocs = postdocs[3:nrow(postdocs),c(2,3,4,5,7)]
> postdocs = postdocs[postdocs$F7 %in% c('Public','Private'),]
> dim(postdocs)
[1] 202   5
> sapply(postdocs,class)
The Top 200 Institutions--Postdoctoral Appointees_(2000) 
                                             "character"
                                                      F3 
                                               "numeric" 
                                                      F4 
                                               "numeric" 
                                                      F5 
                                               "numeric" 
                                                      F7 
                                              "character" 
> names(postdocs) = c('Institution','NPostdocs','Rank','ControlRank','Control')

The RODBC interface was able to correctly classify the numeric columns.

File translated from T_EX by T_TH, version 3.67.
On 9 Mar 2009, 13:58.

Using MySQL with R

1 Accessing databases in R

2 Using SQL in R

3 Reading Spreadsheets with the RODBC Library

3 Reading Spreadsheets with the `RODBC` Library