Using MySQL with R
1 Accessing databases in R
To provide consistent access to different databases, R uses an intermediate
layer of communication known as the DBI (Data Base Interface) library.
As we've seen before, different commands are necessary to get information
about different databases, so the DBI layer provides a set of routines that
will work identically for all databases. Some of these routines are
described in the following table.
dbConnect | connect to a database |
dbDisconnect | close the connection to a database |
dbExistsTable | returns TRUE or FALSE |
dbGetQuery | send a query to the database and get the results |
dbListTables | shows all the tables in a database |
dbListFields | shows the names of columns in a database |
dbSendQuery | send a query to the database and use fetch to get results |
dbWriteTable | stores a data frame in a database |
To let the DBI layer know what type of database you are using, you
pass a string (like 'MySQL' or 'SQLite') to the
dbDriver function. This will return an object which can be
passed to dbConnect to make a connection to the database.
dbConnect returns a connection object which is passed to
all the other routines so that they can communicate with the database
you are using.
For example, to connect to the albums SQLite database, we'd
use the following statements:
drv = dbDriver('SQLite')
con = dbConnect(drv,db='albums.db')
Since SQLite doesn't require a database server, you can easily create
SQLite databases using dbWriteTable. For very large
data sets, you can process the data in chunks, using the append=
option of dbWriteTable.
To use MySQL, you must have a valid username and password. For the
MySQL database running on springer.berkeley.edu, you'd need
to use the following statements:
drv = dbDriver('MySQL')
con = dbConnect(drv,dbname='albums',user='stat133',pass='T0pSecr3t')
If you're using an SCF machine, you'd need to specify a hostname
(springer)
when
connecting; for non-SCF machines, you would set up an SSH tunnel and
use a hostname of '127.0.0.1'.
Once you get the connection object from dbConnect you can use it
repeatedly to make as many queries as you need, and the functions you
use are the same, regardless of the nature of the underlying database.
It's up to you
to decide how much of the work you want to do through the database, and how
much you'll do in R. For example, suppose we want to find albums in the albums
database whose total time (sum of the length variable in the
Track table) is between 2400 and 2700 seconds, displayed in descending
order of total time.
One approach (after loading the library and making the connection to the database)
would be to do all the work in the database. Notice that this may be your only
option if the database you're dealing with is too large to fit into R.
> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS tot\
+ FROM Album AS al,Artist AS ar,Track AS tr\
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid\
+ GROUP BY tr.alid\
+ HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC')
> head(albums)
title name tot
1 'Perk Up' Shelly Manne 2684
2 Kaleidoscope Sonny Stitt 2679
3 Red Garland's Piano (Remastere Red Garland 2676
4 Ask The Ages Sonny Sharrock 2675
5 Duo Charlie Hunter & Leon Parker 2667
6 Tenor Conclave Hank Mobley/Al Cohn/John Coltr 2665
Note the use of a single backslash at the end of each line of input to allow typing
the query over multiple lines. Alternatively, the entire query could
be passed as one long string. The dbGetQuery function
gets the results of the query and puts them into a data frame, in this case
called albums.
At the other end of the extreme, you could read in the entire database and work with it
entirely in R. This may not be the wisest approach, because if the database is
too large, you may have problems. Working with the data in this way, however
will allow us to see the R equivalent of some of the database operations we've
been studying. Once again, I'm assuming that the RMySQL library has
been loaded, and a connection object has been obtained through a call to
dbConnect.
> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
We now have three data frames in R corresponding to the three data tables in
the database. In R, the equivalent of an inner join can be carried out with
the merge function. You provide merge with the data frames
you wish to join, and a character string or vector of character strings
with the name(s) of the variables that you wish to join them by.
In database terminology, these names are the keys for the join. The merge
function will only accept two data frames, but it can be called repeatedly in
case there are more than two data frames that need to be merged.
If there are variables with the same name (other than the keys) in the two data
frames to be merged, merge will rename them by adding either
.x or .y to the end of the name to tell you which data
frame that variable came from. An often easier approach is to rename variables
before merging so that all the names (other than the keys) will be unique. In
the current example, the variable title is in both the album
data frame and the track data frame. Let's rename the title
variable in the album data frame to albumname:
> names(album)
[1] "alid" "aid" "title" "year"
> names(album)[3] = 'albumname'
Now we can merge the album data frame with the artist data frame
using aid as a key, and then merge the track data frame
with the album data frame using alid as a key.
> album = merge(album,artist,by='aid')
> track = merge(track,album,by='alid')
> head(track)
alid aid.x title.x filesize bitrate length aid.y
1 1 439 Makin' Whoopee 5870 256 188 439
2 1 439 People Will Say We're In Love 6020 256 193 439
3 1 439 Until The Real Thing Comes Alo 5588 256 179 439
4 1 439 Nice Work If You Can Get It 4948 256 158 439
5 1 439 My Heart Belongs To Daddy 4998 256 160 439
6 1 439 What Is There To Say 6340 256 203 439
title.y name
1 Songs In A Mellow Mood Ella Fitzgerald
2 Songs In A Mellow Mood Ella Fitzgerald
3 Songs In A Mellow Mood Ella Fitzgerald
4 Songs In A Mellow Mood Ella Fitzgerald
5 Songs In A Mellow Mood Ella Fitzgerald
6 Songs In A Mellow Mood Ella Fitzgerald
To calculate the total length of tracks, we only need the album and artist names
along with the length of each track:
> track = track[,c('albumname','name','length')]
The rest of the processing is pretty straightforward now that everything is in R:
> tot = aggregate(track$length,list(albumname=track$albumname,artist=track$name),sum)
The previous command would be very slow, because aggregate considers all
possible combinations of albumname and artist, which is very
inefficient for data like this. An alternative is to form a single grouping variable
and then to break off the artist and album names after aggregation:
> track$albart = paste(track$albumname,track$name,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist= sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x < 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
x album artist
2397 2698 Up & Down Horace Parlan
999 2697 In New York Cannonball Adderley Sextet
1534 2697 Night Dance Jimmy Giuffre
2502 2694 Yaina Pucho and His Latin Soul Broth
1367 2693 Mambo Moves Garner Errol Garner
1998 2693 Swing Street Hoard Alden Dan Barrett Quinte
A different approach illustrates the relationship between factors and the
principles of normalization. Since a factor stores each of its levels only
once, normalized tables can often be used to form factors in R. Once again,
let's assume that we've loaded the RMySQL library and obtained a
connection object. This time, I'll trim off the unneeded variables from
the track data frame before processing:
> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album = factor(track$album,levels=album$alid,labels=album$title)
The tables formed in the process of normalizing the database essentially
contain the levels and labels of a factor variable in R.
Processing the data set is the same as in the previous example:
> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x < 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
x album artist
2435 2698 Up & Down Horace Parlan
1000 2697 In New York Cannonball Adderley Sextet
1568 2697 Night Dance Jimmy Giuffre
1401 2693 Mambo Moves Garner Errol Garner
2035 2693 Swing Street Hoard Alden Dan Barrett Quinte
1253 2692 Kodachrome: Raymond Scott Comp Metropole Orchestra
In practice, the most useful solutions would probably fall in between the
extreme of doing everything on the database and doing everything in R.
2 Using SQL in R
If you like using SQL, or you see a handy way to get something done
using SQL, you can use the sqldf package to operate on R
data frames using SQL.
When you pass a query to the sqldf function, it
will identify the data frames involved in the query, create a temporary
SQLite database, perform the query, and return the results.
For example, consider the task of merging two data frames, based
on the value of a variable id:
> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
id x y
1 1 7 35
2 4 12 29
3 2 19 21
4 3 15 32
5 5 9 19
3 Reading Spreadsheets with the RODBC Library
Now that we've got the basics of SQL down, we can look at a
third option for reading spreadsheets into R, the RODBC
library. Note that this functionality is only available on Windows
and you don't need a copy of Excel on your computer in order for it
to work. The ODBC interface provides access to spreadsheets using
SQL as a query language. Here's a sample session showing how we
could read the spreadsheet from the previous example using
RODBC:
> library(RODBC)
> con = odbcConnectExcel('2_2000_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
tbls
TABLE_CAT TABLE_SCHEM
1 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc <NA>
2 C:\\cygwin\\home\\spector\\2_2000_top200_postdoc <NA>
TABLE_NAME TABLE_TYPE REMARKS
1 'Postdocs -Top 200 $' TABLE <NA>
2 'Postdocs -Top 200 $'Print_Titles TABLE <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,as.is=TRUE)
> head(postdocs)
F1 The Top 200 Institutions--Postdoctoral Appointees_(2000) F3 F4 F5 F6
1 NA <NA> NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2000) NA NA NA NA
3 NA Harvard University 3491 1 1 NA
4 NA Stanford University 1196 2 2 NA
5 NA Yeshiva University 1122 3 3 NA
6 NA Johns Hopkins University 1029 4 4 NA
F7 F8 F9 F10 F11 F12 F13
1 <NA> NA NA NA NA NA NA
2 Institutional\nControl NA NA NA NA NA NA
3 Private NA NA NA NA NA NA
4 Private NA NA NA NA NA NA
5 Private NA NA NA NA NA NA
6 Private NA NA NA NA NA NA
> postdocs = postdocs[3:nrow(postdocs),c(2,3,4,5,7)]
> postdocs = postdocs[postdocs$F7 %in% c('Public','Private'),]
> dim(postdocs)
[1] 202 5
> sapply(postdocs,class)
The Top 200 Institutions--Postdoctoral Appointees_(2000)
"character"
F3
"numeric"
F4
"numeric"
F5
"numeric"
F7
"character"
> names(postdocs) = c('Institution','NPostdocs','Rank','ControlRank','Control')
The RODBC interface was able to correctly classify the numeric columns.
File translated from
TEX
by
TTH,
version 3.67.
On 9 Mar 2009, 13:58.