Databases
1 Regular Expressions in SQL
For things like partial matching of strings, all flavors of SQL provide the
LIKE operator, which allows limited wildcard matching. With the
LIKE operator the percent sign (%) stands for 0 or more characters,
and the underscore (_) stands for exactly one. Many modern databases
(MySQL included) provide the RLIKE operator, which uses regular expressions.
For the remainder of these examples, we'll use the MySQL database running on
springer.berkeley.edu. We'll write a regular expression to find all
the artists who have exactly three words in their names.
springer.spector$ mysql -u stat133 -p albums
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 5.0.51a-3ubuntu5.4-log (Ubuntu)
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> SELECT name FROM Artist
-> WHERE name RLIKE '^[A-Za-z]+ [A-Za-z]+ [A-Za-z]+$';
+----------------------------------+
| name |
+----------------------------------+
| Mary Lou Williams |
| Dexter Gordon Quartet |
| Modern Jazz Septet |
| Eddie South Trio |
| Gerry Mulligan Tentette |
| Bill Harris rarities |
| Jimmy Heath Sextet |
| Horae Parlan Trio |
Note that in order to access the MySQL database, a username and
password needed to be used.
If you look at the original definition of the SELECT statement, you'll
notice the HAVING clause which we have not yet covered. The basic
difference between HAVING and WHERE is that, due to the timing
of operations in the database, we can't use computed variables (like SUM
and AVG) in the WHERE clause, but you can in the HAVING
clause. Suppose we wanted to find the artists for which there were more than
five albums in the database. A first try might be as follows:
mysql > SELECT aid,COUNT(aid) as count FROM Album
-> WHERE count > 5
-> GROUP BY aid;
ERROR 1054 (42S22): Unknown column 'count' in 'where clause'
Since the WHERE clause doesn't have access to columns you've created through
computations, MySQL can't find the computed variable (count). To fix
this, add a HAVING clause after the GROUP BY clause:
mysql> SELECT aid,COUNT(aid) as count FROM Album
-> GROUP BY aid
-> HAVING count > 5;
+------+-------+
| aid | count |
+------+-------+
| 0 | 6 |
| 14 | 6 |
| 20 | 9 |
| 24 | 6 |
| 40 | 15 |
| 74 | 6 |
| 92 | 6 |
| 101 | 6 |
| 107 | 9 |
. . .
Of course, queries like this are infinitely more useful if the bring in the relevant
information from other tables:
mysql> SELECT ar.name as Artist,COUNT(al.alid) AS count
-> FROM Artist as ar, Album as al
-> WHERE ar.aid = al.aid
-> GROUP BY al.aid HAVING count > 5;
| Abdullah Ibrahim | 6 |
| Ahmad Jamal | 9 |
| Al Cohn | 6 |
| Al Haig | 15 |
| Andre Previn | 6 |
| Anita O'Day | 6 |
| Archie Shepp | 6 |
. . .
2 Accessing databases in R
To provide consistent access to different databases, R uses an intermediate
layer of communication known as the DBI (Data Base Interface) library.
As we've seen before, different commands are necessary to get information
about different databases, so the DBI layer provides a set of routines that
will work identically for all databases. Some of these routines are
described in the following table.
dbConnect | connect to a database |
dbDisconnect | close the connection to a database |
dbExistsTable | returns TRUE or FALSE |
dbGetQuery | send a query to the database and get the results |
dbListTables | shows all the tables in a database |
dbListFields | shows the names of columns in a database |
dbSendQuery | send a query to the database and use fetch to get results |
dbWriteTable | stores a data frame in a database |
To let the DBI layer know what type of database you are using, you
pass a string (like 'MySQL' or 'SQLite') to the
dbDriver function. This will return an object which can be
passed to dbConnect to make a connection to the database.
dbConnect returns a connection object which is passed to
all the other routines so that they can communicate with the database
you are using.
For example, to connect to the albums SQLite database, we'd
use the following statements:
library(RSQLite)
drv = dbDriver('SQLite')
con = dbConnect(drv,db='albums.db')
Since SQLite doesn't require a database server, you can easily create
SQLite databases using dbWriteTable. For very large
data sets, you can process the data in chunks, using the append=
option of dbWriteTable.
To use MySQL, you must have a valid username and password. For the
MySQL database running on springer.berkeley.edu, you'd need
to use the following statements:
drv = dbDriver('MySQL')
con = dbConnect(drv,dbname='albums',user='stat133',pass='T0pSecr3t')
If you're using an SCF machine, you'd need to specify a hostname
(springer)
when
connecting; for non-SCF machines, you would set up an SSH tunnel and
use a hostname of '127.0.0.1'.
Once you get the connection object from dbConnect you can use it
repeatedly to make as many queries as you need, and the functions you
use are the same, regardless of the nature of the underlying database.
It's up to you
to decide how much of the work you want to do through the database, and how
much you'll do in R. For example, suppose we want to find albums in the albums
database whose total time (sum of the length variable in the
Track table) is between 2400 and 2700 seconds, displayed in descending
order of total time.
One approach (after loading the library and making the connection to the database)
would be to do all the work in the database. Notice that this may be your only
option if the database you're dealing with is too large to fit into R.
> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS tot
+ FROM Album AS al,Artist AS ar,Track AS tr
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid
+ GROUP BY tr.alid
+ HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC')
> head(albums)
title name tot
1 Blues Groove Tiny Grimes & Coleman Hawkins 2699
2 I Need Some Money Eddie Harris 2699
3 Young Chet Chet Baker 2699
4 Up & Down Horace Parlan 2698
5 Arcadia Shuffle Roy Eldridge 2697
6 Night Dance Jimmy Giuffre 2697
It's usually most convenient to spread out
the query over multiple lines, but the entire query could
be passed as one long string. The dbGetQuery function
gets the results of the query and puts them into a data frame, in this case
called albums.
Another way to find the albums with total time between 2400 and 2700 would be to
use SQL to get the total time of all the albums, and then use the subset
function in R to get the ones in the desired time range:
> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS "Total Time"
+ FROM Album AS al,Artist AS ar,Track AS tr
+ WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid
+ GROUP BY tr.alid')
> myalbums = subset(albums,`Total Time` > 2400 & `Total Time` < 2700)
> myalbums = myalbums[order(myalbums$"Total Time",decreasing=TRUE),]
> head(myalbums)
title name Total Time
821 Blues Groove Tiny Grimes & Coleman Hawkins 2699
2206 I Need Some Money Eddie Harris 2699
5632 Young Chet Chet Baker 2699
5337 Up & Down Horace Parlan 2698
418 Arcadia Shuffle Roy Eldridge 2697
3390 Night Dance Jimmy Giuffre 2697
Here I used a more descriptive name for the total time, namely Total Time.
Note that, since there's a space in the variable name, I need to surround it with
backquotes (`) in the subset function, but when I refer to the variable in
the order function, I used ordinary quotes. Backquotes will work in either case.
At the other end of the extreme, you could read in the entire database and work with it
entirely in R. This may not be the wisest approach, because if the database is
too large, you may have problems. Working with the data in this way, however
will allow us to see the R equivalent of some of the database operations we've
been studying. Once again, I'm assuming that the RMySQL library has
been loaded, and a connection object has been obtained through a call to
dbConnect.
> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
We now have three data frames in R corresponding to the three data tables in
the database. In R, the equivalent of an inner join can be carried out with
the merge function. You provide merge with the data frames
you wish to join, and a character string or vector of character strings
with the name(s) of the variables that you wish to join them by.
In database terminology, these names are the keys for the join. The merge
function will only accept two data frames, but it can be called repeatedly in
case there are more than two data frames that need to be merged.
If there are variables with the same name (other than the keys) in the two data
frames to be merged, merge will rename them by adding either
.x or .y to the end of the name to tell you which data
frame that variable came from. An often easier approach is to rename variables
before merging so that all the names (other than the keys) will be unique. In
the current example, the variable title is in both the album
data frame and the track data frame. Let's rename the title
variable in the album data frame to albumname:
> names(album)
[1] "alid" "aid" "title" "year"
> names(album)[3] = 'albumname'
Now we can merge the album data frame with the artist data frame
using aid as a key, and then merge the track data frame
with the album data frame using alid as a key.
> album = merge(album,artist,by='aid')
> track = merge(track,album,by='alid')
> head(track)
alid aid.x title filesize bitrate length aid.y albumname
1 1 2836 Sonnymoon For Two 6459 189 279 2836 100% Proof
2 1 2836 Nutty 6062 191 259 2836 100% Proof
3 1 2836 100% Proof 19877 190 853 2836 100% Proof
4 1 2836 Bluesology 7101 189 306 2836 100% Proof
5 1 2836 Night In Tunisia 9102 190 391 2836 100% Proof
6 1 2836 Milestones 10774 190 463 2836 100% Proof
name
1 Tubby Hayes Orchestra
2 Tubby Hayes Orchestra
3 Tubby Hayes Orchestra
4 Tubby Hayes Orchestra
5 Tubby Hayes Orchestra
6 Tubby Hayes Orchestra
To calculate the total length of tracks, we only need the album and artist names
along with the length of each track:
> track = track[,c('albumname','name','length')]
The rest of the processing is pretty straightforward now that everything is in R:
> tot = aggregate(track$length,list(albumname=track$albumname,artist=track$name),sum)
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
albumname artist x
1105 Young Chet Chet Baker 2699
1748 I Need Some Money Eddie Harris 2699
5245 Blues Groove Tiny Grimes & Coleman Hawkins 2699
2520 Up & Down Horace Parlan 2698
2848 Night Dance Jimmy Giuffre 2697
4559 Arcadia Shuffle Roy Eldridge 2697
Because of the way merge operates, ties are
resolved differently than in previous examples, but the results
are essentially the same.
A different approach illustrates the relationship between factors and the
principles of normalization. Since a factor stores each of its levels only
once, normalized tables can often be used to form factors in R. Once again,
let's assume that we've loaded the RMySQL library and obtained a
connection object. This time, I'll trim off the unneeded variables from
the track data frame before processing:
> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album = factor(track$album,levels=album$alid,labels=album$title)
The tables formed in the process of normalizing the database essentially
contain the levels and labels of a factor variable in R.
Processing the data set is the same as in the previous example:
> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
x album artist
821 2699 Blues Groove Tiny Grimes & Coleman Hawkins
2206 2699 I Need Some Money Eddie Harris
5632 2699 Young Chet Chet Baker
5337 2698 Up & Down Horace Parlan
418 2697 Arcadia Shuffle Roy Eldridge
3390 2697 Night Dance Jimmy Giuffre
In practice, the most useful solutions would probably fall in between the
extreme of doing everything on the database and doing everything in R.
3 Using SQL in R
If you like using SQL, or you see a handy way to get something done
using SQL, you can use the sqldf package to operate on R
data frames using SQL.
When you pass a query to the sqldf function, it
will identify the data frames involved in the query, create a temporary
SQLite database, perform the query, and return the results.
For example, consider the task of merging two data frames, based
on the value of a variable id:
> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
id x y
1 1 7 35
2 4 12 29
3 2 19 21
4 3 15 32
5 5 9 19
4 Reading Spreadsheets with the RODBC Library
Now that we've got the basics of SQL down, we can look at a
third option for reading spreadsheets into R, the RODBC
library. Note that this functionality is only available on Windows
and you don't need a copy of Excel on your computer in order for it
to work. The ODBC interface provides access to spreadsheets using
SQL as a query language. Here's a sample session showing how we
could read the spreadsheet from the previous example using
RODBC:
> library(RODBC)
> con = odbcConnectExcel('2_2005_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
TABLE_CAT
1 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
2 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
3 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
4 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
5 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
TABLE_SCHEM TABLE_NAME TABLE_TYPE REMARKS
1 <NA> Sheet1$ SYSTEM TABLE <NA>
2 <NA> Sheet2$ SYSTEM TABLE <NA>
3 <NA> Sheet3$ SYSTEM TABLE <NA>
4 <NA> Sheet1$Print_Area TABLE <NA>
5 <NA> Sheet1$Print_Titles TABLE <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,stringsAsFactors=FALSE)
> head(postdocs)
F1 The Top 200 Institutions--Postdoctoral Appointees_(2005) F3 F4 F5 F6
1 NA <NA> NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2005) NA NA NA NA
3 NA Harvard University 4384 1 1 NA
4 NA Johns Hopkins University 1442 2 2 NA
5 NA Stanford University 1259 3 3 NA
6 NA University of California - Los Angeles 1094 4 1 NA
F7
1 <NA>
2 Institutional\nControl
3 Private
4 Private
5 Private
6 Public
> sapply(postdocs,class)
F1
"logical"
The Top 200 Institutions--Postdoctoral Appointees_(2005)
"character"
F3
"numeric"
F4
"numeric"
F5
"numeric"
F6
"logical"
F7
"character"
> dim(postdocs)
[1] 208 7
The RODBC interface was able to correctly classify the numeric columns.
File translated from
TEX
by
TTH,
version 3.67.
On 11 Mar 2011, 15:50.