Databases

1 Regular Expressions in SQL

For things like partial matching of strings, all flavors of SQL provide the LIKE operator, which allows limited wildcard matching. With the LIKE operator the percent sign (%) stands for 0 or more characters, and the underscore (_) stands for exactly one. Many modern databases (MySQL included) provide the RLIKE operator, which uses regular expressions. For the remainder of these examples, we'll use the MySQL database running on springer.berkeley.edu. We'll write a regular expression to find all the artists who have exactly three words in their names.

springer.spector$ mysql -u stat133 -p albums
Enter password: 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 5.0.51a-3ubuntu5.4-log (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SELECT name FROM Artist
    ->    WHERE name RLIKE '^[A-Za-z]+ [A-Za-z]+ [A-Za-z]+$';
+----------------------------------+
| name                             |
+----------------------------------+
| Mary Lou Williams                | 
| Dexter Gordon Quartet            | 
| Modern Jazz Septet               | 
| Eddie South Trio                 | 
| Gerry Mulligan Tentette          | 
| Bill Harris rarities             | 
| Jimmy Heath Sextet               | 
| Horae Parlan Trio                |

Note that in order to access the MySQL database, a username and password needed to be used.

If you look at the original definition of the SELECT statement, you'll notice the HAVING clause which we have not yet covered. The basic difference between HAVING and WHERE is that, due to the timing of operations in the database, we can't use computed variables (like SUM and AVG) in the WHERE clause, but you can in the HAVING clause. Suppose we wanted to find the artists for which there were more than five albums in the database. A first try might be as follows:

mysql > SELECT aid,COUNT(aid) as count FROM Album 
     -> WHERE count > 5
     -> GROUP BY aid; 
ERROR 1054 (42S22): Unknown column 'count' in 'where clause'

Since the WHERE clause doesn't have access to columns you've created through computations, MySQL can't find the computed variable (count). To fix this, add a HAVING clause after the GROUP BY clause:

mysql> SELECT aid,COUNT(aid) as count FROM Album
    ->    GROUP BY aid
    ->    HAVING count > 5;
+------+-------+
| aid  | count |
+------+-------+
|    0 |     6 | 
|   14 |     6 | 
|   20 |     9 | 
|   24 |     6 | 
|   40 |    15 | 
|   74 |     6 | 
|   92 |     6 | 
|  101 |     6 | 
|  107 |     9 | 

       . . .

Of course, queries like this are infinitely more useful if the bring in the relevant information from other tables:

mysql> SELECT ar.name as Artist,COUNT(al.alid) AS count
    ->      FROM Artist as ar, Album as al
    ->      WHERE ar.aid = al.aid
    ->      GROUP BY al.aid HAVING count > 5;
| Abdullah Ibrahim                 |     6 | 
| Ahmad Jamal                      |     9 | 
| Al Cohn                          |     6 | 
| Al Haig                          |    15 | 
| Andre Previn                     |     6 | 
| Anita O'Day                      |     6 | 
| Archie Shepp                     |     6 | 
       . . .

2 Accessing databases in R

To provide consistent access to different databases, R uses an intermediate layer of communication known as the DBI (Data Base Interface) library. As we've seen before, different commands are necessary to get information about different databases, so the DBI layer provides a set of routines that will work identically for all databases. Some of these routines are described in the following table.

`dbConnect`	connect to a database
`dbDisconnect`	close the connection to a database
`dbExistsTable`	returns TRUE or FALSE
`dbGetQuery`	send a query to the database and get the results
`dbListTables`	shows all the tables in a database
`dbListFields`	shows the names of columns in a database
`dbSendQuery`	send a query to the database and use `fetch` to get results
`dbWriteTable`	stores a data frame in a database

To let the DBI layer know what type of database you are using, you pass a string (like 'MySQL' or 'SQLite') to the dbDriver function. This will return an object which can be passed to dbConnect to make a connection to the database. dbConnect returns a connection object which is passed to all the other routines so that they can communicate with the database you are using.

For example, to connect to the albums SQLite database, we'd use the following statements:

library(RSQLite)
drv = dbDriver('SQLite')
con = dbConnect(drv,db='albums.db')

Since SQLite doesn't require a database server, you can easily create SQLite databases using dbWriteTable. For very large data sets, you can process the data in chunks, using the append= option of dbWriteTable.

To use MySQL, you must have a valid username and password. For the MySQL database running on springer.berkeley.edu, you'd need to use the following statements:

drv = dbDriver('MySQL')
con = dbConnect(drv,dbname='albums',user='stat133',pass='T0pSecr3t')

If you're using an SCF machine, you'd need to specify a hostname (springer) when connecting; for non-SCF machines, you would set up an SSH tunnel and use a hostname of '127.0.0.1'.

Once you get the connection object from dbConnect you can use it repeatedly to make as many queries as you need, and the functions you use are the same, regardless of the nature of the underlying database.

It's up to you to decide how much of the work you want to do through the database, and how much you'll do in R. For example, suppose we want to find albums in the albums database whose total time (sum of the length variable in the Track table) is between 2400 and 2700 seconds, displayed in descending order of total time.

One approach (after loading the library and making the connection to the database) would be to do all the work in the database. Notice that this may be your only option if the database you're dealing with is too large to fit into R.

> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS tot
+                                 FROM Album AS al,Artist AS ar,Track AS tr
+                                 WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid
+                                 GROUP BY tr.alid
+                                 HAVING tot BETWEEN 2400 AND 2700 ORDER BY tot DESC')
> head(albums)
              title                           name  tot
1      Blues Groove Tiny Grimes  & Coleman Hawkins 2699
2 I Need Some Money                   Eddie Harris 2699
3        Young Chet                     Chet Baker 2699
4         Up & Down                  Horace Parlan 2698
5   Arcadia Shuffle                   Roy Eldridge 2697
6       Night Dance                  Jimmy Giuffre 2697

It's usually most convenient to spread out the query over multiple lines, but the entire query could be passed as one long string. The dbGetQuery function gets the results of the query and puts them into a data frame, in this case called albums.

Another way to find the albums with total time between 2400 and 2700 would be to use SQL to get the total time of all the albums, and then use the subset function in R to get the ones in the desired time range:

> albums = dbGetQuery(con,statement='SELECT al.title,ar.name,SUM(tr.length) AS "Total Time"
+                                 FROM Album AS al,Artist AS ar,Track AS tr
+                                 WHERE tr.alid = al.alid AND tr.aid = ar.aid AND tr.aid = al.aid
+                                 GROUP BY tr.alid')
> myalbums = subset(albums,`Total Time` > 2400 & `Total Time` < 2700)
> myalbums = myalbums[order(myalbums$"Total Time",decreasing=TRUE),]
> head(myalbums)
                 title                           name Total Time
821       Blues Groove Tiny Grimes  & Coleman Hawkins       2699
2206 I Need Some Money                   Eddie Harris       2699
5632        Young Chet                     Chet Baker       2699
5337         Up & Down                  Horace Parlan       2698
418    Arcadia Shuffle                   Roy Eldridge       2697
3390       Night Dance                  Jimmy Giuffre       2697

Here I used a more descriptive name for the total time, namely Total Time. Note that, since there's a space in the variable name, I need to surround it with backquotes (`) in the subset function, but when I refer to the variable in the order function, I used ordinary quotes. Backquotes will work in either case.

At the other end of the extreme, you could read in the entire database and work with it entirely in R. This may not be the wisest approach, because if the database is too large, you may have problems. Working with the data in this way, however will allow us to see the R equivalent of some of the database operations we've been studying. Once again, I'm assuming that the RMySQL library has been loaded, and a connection object has been obtained through a call to dbConnect.

> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")

We now have three data frames in R corresponding to the three data tables in the database. In R, the equivalent of an inner join can be carried out with the merge function. You provide merge with the data frames you wish to join, and a character string or vector of character strings with the name(s) of the variables that you wish to join them by. In database terminology, these names are the keys for the join. The merge function will only accept two data frames, but it can be called repeatedly in case there are more than two data frames that need to be merged.

If there are variables with the same name (other than the keys) in the two data frames to be merged, merge will rename them by adding either .x or .y to the end of the name to tell you which data frame that variable came from. An often easier approach is to rename variables before merging so that all the names (other than the keys) will be unique. In the current example, the variable title is in both the album data frame and the track data frame. Let's rename the title variable in the album data frame to albumname:

> names(album)
[1] "alid"  "aid"   "title" "year"
> names(album)[3] = 'albumname'

Now we can merge the album data frame with the artist data frame using aid as a key, and then merge the track data frame with the album data frame using alid as a key.

> album = merge(album,artist,by='aid')
> track = merge(track,album,by='alid')
> head(track)
  alid aid.x             title filesize bitrate length aid.y  albumname
1    1  2836 Sonnymoon For Two     6459     189    279  2836 100% Proof
2    1  2836             Nutty     6062     191    259  2836 100% Proof
3    1  2836        100% Proof    19877     190    853  2836 100% Proof
4    1  2836        Bluesology     7101     189    306  2836 100% Proof
5    1  2836  Night In Tunisia     9102     190    391  2836 100% Proof
6    1  2836        Milestones    10774     190    463  2836 100% Proof
                   name
1 Tubby Hayes Orchestra
2 Tubby Hayes Orchestra
3 Tubby Hayes Orchestra
4 Tubby Hayes Orchestra
5 Tubby Hayes Orchestra
6 Tubby Hayes Orchestra

To calculate the total length of tracks, we only need the album and artist names along with the length of each track:

> track = track[,c('albumname','name','length')]

The rest of the processing is pretty straightforward now that everything is in R:

> tot = aggregate(track$length,list(albumname=track$albumname,artist=track$name),sum)
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
             albumname                         artist    x
1105        Young Chet                     Chet Baker 2699
1748 I Need Some Money                   Eddie Harris 2699
5245      Blues Groove Tiny Grimes  & Coleman Hawkins 2699
2520         Up & Down                  Horace Parlan 2698
2848       Night Dance                  Jimmy Giuffre 2697
4559   Arcadia Shuffle                   Roy Eldridge 2697

Because of the way merge operates, ties are resolved differently than in previous examples, but the results are essentially the same.

A different approach illustrates the relationship between factors and the principles of normalization. Since a factor stores each of its levels only once, normalized tables can often be used to form factors in R. Once again, let's assume that we've loaded the RMySQL library and obtained a connection object. This time, I'll trim off the unneeded variables from the track data frame before processing:

> album = dbGetQuery(con,statement="select * from Album")
> track = dbGetQuery(con,statement="select * from Track")
> artist = dbGetQuery(con,statement="select * from Artist")
> track = track[,c('aid','alid','length')]
> names(track) = c('artist','album','length')
> track$artist = factor(track$artist,levels=artist$aid,labels=artist$name)
> track$album =  factor(track$album,levels=album$alid,labels=album$title)

The tables formed in the process of normalizing the database essentially contain the levels and labels of a factor variable in R. Processing the data set is the same as in the previous example:

> track$albart = paste(track$album,track$artist,sep='\t')
> tot = aggregate(track$length,track['albart'],sum)
> aa = strsplit(tot$albart,'\t')
> tot$album = sapply(aa,'[',1)
> tot$artist = sapply(aa,'[',2)
> tot$albart = NULL
> use = tot[tot$x > 2400 & tot$x <= 2700,]
> use = use[order(use$x,decreasing=TRUE),]
> head(use)
        x                            album                         artist
821  2699      Blues Groove Tiny Grimes  & Coleman Hawkins
2206 2699 I Need Some Money                   Eddie Harris
5632 2699        Young Chet                     Chet Baker
5337 2698         Up & Down                  Horace Parlan
418  2697   Arcadia Shuffle                   Roy Eldridge
3390 2697       Night Dance                  Jimmy Giuffre

In practice, the most useful solutions would probably fall in between the extreme of doing everything on the database and doing everything in R.

3 Using SQL in R

If you like using SQL, or you see a handy way to get something done using SQL, you can use the sqldf package to operate on R data frames using SQL. When you pass a query to the sqldf function, it will identify the data frames involved in the query, create a temporary SQLite database, perform the query, and return the results. For example, consider the task of merging two data frames, based on the value of a variable id:

> data1 = data.frame(id=c(1,4,2,3,5),x=c(7,12,19,15,9))
> data2 = data.frame(id=c(2,3,4,1,5),y=c(21,32,29,35,19))
> sqldf('select * from data1 natural join data2')
  id  x  y
1  1  7 35
2  4 12 29
3  2 19 21
4  3 15 32
5  5  9 19

4 Reading Spreadsheets with the `RODBC` Library

Now that we've got the basics of SQL down, we can look at a third option for reading spreadsheets into R, the RODBC library. Note that this functionality is only available on Windows and you don't need a copy of Excel on your computer in order for it to work. The ODBC interface provides access to spreadsheets using SQL as a query language. Here's a sample session showing how we could read the spreadsheet from the previous example using RODBC:

> library(RODBC)
> con = odbcConnectExcel('2_2005_top200_postdoc.xls')
> tbls = sqlTables(con)
> tbls
                                                            TABLE_CAT
1 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
2 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
3 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
4 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
5 C:\\Documents and Settings\\spector\\Desktop\\2_2005_top200_postdoc
  TABLE_SCHEM          TABLE_NAME   TABLE_TYPE REMARKS
1        <NA>             Sheet1$ SYSTEM TABLE    <NA>
2        <NA>             Sheet2$ SYSTEM TABLE    <NA>
3        <NA>             Sheet3$ SYSTEM TABLE    <NA>
4        <NA>   Sheet1$Print_Area        TABLE    <NA>
5        <NA> Sheet1$Print_Titles        TABLE    <NA>
> qry = paste('select * from ','[',tbls$TABLE_NAME[1],']',sep='')
> postdocs = sqlQuery(con,qry,stringsAsFactors=FALSE)
> head(postdocs)
  F1 The Top 200 Institutions--Postdoctoral Appointees_(2005)   F3 F4 F5 F6
1 NA                                                     <NA>   NA NA NA NA
2 NA Top 50 Institutions\nin Postdoctoral Appointees \n(2005)   NA NA NA NA
3 NA                                       Harvard University 4384  1  1 NA
4 NA                                 Johns Hopkins University 1442  2  2 NA
5 NA                                      Stanford University 1259  3  3 NA
6 NA                   University of California - Los Angeles 1094  4  1 NA
                      F7
1                   <NA>
2 Institutional\nControl
3                Private
4                Private
5                Private
6                 Public
> sapply(postdocs,class)
                                                      F1 
                                               "logical" 
The Top 200 Institutions--Postdoctoral Appointees_(2005) 
                                             "character" 
                                                      F3 
                                               "numeric" 
                                                      F4 
                                               "numeric" 
                                                      F5 
                                               "numeric" 
                                                      F6 
                                               "logical" 
                                                      F7 
                                             "character" 
> dim(postdocs)
[1] 208   7

The RODBC interface was able to correctly classify the numeric columns.

File translated from T_EX by T_TH, version 3.67.
On 11 Mar 2011, 15:50.

Databases

1 Regular Expressions in SQL

2 Accessing databases in R

3 Using SQL in R

4 Reading Spreadsheets with the RODBC Library

4 Reading Spreadsheets with the `RODBC` Library