Databases

1 Databases

A database is a collection of data, usually with some information (sometimes called metadata) about how the data is organized. But many times when people refer to a database, they mean a database server, which is similar to a web server, but responds to requests for data instead of web pages.

By far the most common type of database server is known a a relational database management system (RDBMS), and the most common way of communicating with such a database is a language known as SQL, which is an acronym for Structured Query Language. Some examples of database systems that use SQL to communicate with an RDBMS include Oracle, Sybase, Microsoft SQL Server, SQLite, MySQL and Postgres. While there is an SQL standard, each database manufacturer provides some additional features, so SQL that works with one database is not guaranteed to work on another. We'll try to stick to aspects of SQL that should be available on most SQL based systems.

A Database consists of one or more tables, which are generally stored as files on the computer that's running the DBMS. A table is a rectangular array, where each row represents an observation, and each column represents a variable. Most databases consists of several tables, each containing information about one aspect of the data. For example, consider a database to hold information about the parts needed to build a variety of different products. One way to store the information would be to have a data table that had the part id, its description, the supplier's name, and the cost of the part. An immediate problem with this is scheme concerns how we could store information about the relation between products and parts. Furthermore, if one supplier provided many different parts, redundant information about suppliers would be repeated many times in the data table. In the 1970s when database technology was first being developed, disk and memory space were limited and expensive, and organizing data in this way was not very efficient, especially as the size of the data base increased. In the late 1970s, the idea of a relational database was developed by IBM, with the first commercial offering coming from a company which is now known as Oracle. Relational database design is governed by a principle known as normalization. While entire books are devoted to the subject, the basic idea of normalization is to try and remove redundancy as much as possible when creating the tables that make up a data base. Continuing the parts example, a properly normalized database to hold the parts information would consist of four tables. The first would contain a part id to uniquely identify the part, its description, price and an id to identify the supplier. A second table would contain the supplier codes and any other information about the supplier. A third table would contain product ids and descriptions, while the final table would have one record for each part used in each product, stored as pairs of product id and part id. The variables that link together the different databases are refered to as keys, or sometimes foreign keys. Clearly, making sure that there are keys to link information from one table to another is critical to the idea of a normalized data base. This allows large amounts of information to be stored in manageably sized tables which can be modified and updated without having to change all of the information in all of the tables. Such tables will be efficient in the amount of disk and memory resources that they need, which was critical at the time such databases were developed. Also critical to this scheme is a fast and efficient way of joining together these tables so that queries like "Which suppliers are used for product xyz?" or "What parts from Acme Machine Works cost between $2 and $4" or "What is the total cost of the parts needed to make product xyz?". In fact, for many years the only programs that were capable of combining data from multiple sources were RDBMSs.

We're going to look at the way the SQL language is used to extract information from RDBMSs, specifically the open source SQLite data base (http://sqlite.org) (which doesn't require a database server), and the open source MySQL data base (http://www.mysql.com) which does. In addition we'll see how to do typical database operations in R, and how to access a database using SQL from inside of R.

One of the first questions that arises when thinking about databases is "When should I use a database?" There are several cases where a database makes sense:

The only source for the data you need may be an existing database.
The data you're working with changes on a regular basis, especially if it can potentially be changed by many people. Database servers (like most other servers) provide concurrency, which takes care of problems that might otherwise arise if more than one person tries to access or modify a database at the same time.
Your data is simply too big to fit into the computer's memory, but if you could conveniently extract just the parts you need, you'd be able to work with the data. For example, you may have hundreds of variables, but only need to work with a small number at at time, or you have tens of thousands of observations, but only need to work with a few thousand at a time.

One of the most important uses of databases in the last ten years has been as back ends to dynamic web sites. If you've ever placed an order online, joined an online discussion community, or gotten a username/password combination to access web resources, you've most likely interacted with a database. In addition both the Firefox and Safari browsers us SQLite databases to store cookies, certificates, and other information.

There are actually three parts to a RDBMS system: data definition, data access, and privelege management. We're only going to look at the data access aspect of databases; we'll assume that a database is already available, and that the administrator of the database has provided you with access to the resources you need. You can communicate with a database in at least three different ways:

You can use a command line program that will display its results in a way similar to the way R works: you type in a query, and the results are displayed.
You can use a graphical interface (GUI) that will display the results and give you the option to export them in different forms, such as a comma-separated data file. One advantage to these clients is that they usually provide a display of the available databases, along with the names of the tables in the database and the names of the columns within the tables.
You can use a library within R to query the database and return the results of the query into an R data frame. For use with MySQL library, the RMySQL library is available for download from CRAN; for SQLite, the RSQLite package is available. When you install either, the DBI library, which provides a consistent interface to different databases, will also be installed.

It's very important to understand that SQL is not a programming language - it's said to be a declarative language. Instead of solving problems by writing a series of instructions or putting together programming statements into functions, with SQL you write a single statement (query) that will return some or all of the records from a database. SQL was designed to be easy to use for non-programmers, allowing them to present queries to the database in a language that is more similar to a spoken language than a programming language. The main tool for data access is the select statement. Since everything needs to be done with this single statement, descriptions of its syntax can be quite daunting. For example, here's an example that shows some of the important features of the select statement:

SELECT columns or computations
       FROM table
       WHERE condition
       GROUP BY columns
       HAVING condition
       ORDER BY column [ASC | DESC]
       LIMIT offset,count;

(The keywords in the above statement can be typed in upper or lower case, but I'll generally use upper case to remind us that they are keywords.) MySQL is case sensistive to table and column names, but not for commands. SQLite is not case sensitive at all. SQL statements always end with a semi-colon, but graphical clients and interfaces in other languages sometimes don't require the semi-colon. Fortunately, you rarely need to use all of these commands in a single query. In fact, you may be able to get by knowing just the command that returns all of the information from a table in a database:

       SELECT * from table;

2 Working with Databases

When you're not using a graphical interface, it's easy to get lost in a database considering that each database can hold multiple tables, and each table can have multiple columns. Three commands that are useful in finding your way around a database are:

MySQL	SQLite
SHOW DATABASES;	.databases
SHOW TABLES IN database;	.tables
SHOW COLUMNS IN table;	pragma table_info(table)
DESCRIBE table;	.schema table

For use as an example, consider a database to keep track of a music collection, stored as an SQLite database in the file albums.db. We can use the commands just shown to see how the database is organized. In class, I sometimes used some of the graphical SQLite clients that are mentioned on the Class Resources page. Here, I'm using the UNIX command-line client called sqlite3:

springer.spector$ sqlite3 albums.db 
SQLite version 3.4.2
Enter ".help" for instructions
sqlite> .mode column
sqlite> .header on
sqlite> .tables
Album   Artist  Track 
	title TEXT 
);
sqlite> .schema Album
CREATE TABLE Album 
( aid INTEGER,
	alid INTEGER,
	title TEXT 
);
CREATE INDEX jj on Album(alid);
sqlite> .schema Artist
CREATE TABLE Artist 
( aid INTEGER,
	name TEXT 
);
CREATE INDEX ii on Artist(aid);

Studying these tables we can see that the organization is as follows: each track in the Track table is identified by an artist id (aid) and an album id (alid). To get the actual names of the artists, we can link to the Artist database through the aid key; to get the names of the albums, we can link to the Album table through the alid key. These keys are integers, which don't take up very much space; the actual text of the album title or artist's name is stored only once in the table devoted to that purpose. (This is the basic principle of normalization that drives database design.) The first thing to notice in schemes like this is that the individual tables won't allow us to really see anything iteresting, since they just contain keys, not actual values; to get interesting information, we need to join together multiple tables. For example, let's say we want a list of album titles and artist names. We want to extract the Name from the Artist table and combine it with Title from the Album table, based on a matching value of the aid variable:

sqlite> .width 25 50
sqlite> SELECT Artist.name, Album.title FROM Artist, Album WHERE Artist.aid = Album.aid;
Gloria Wood                Wood by the Fire                                  
Bobby Hackett              Jazz Session                                      
Bobby Hackett              1943-1947                                         
Bobby Hackett              Sextet                                            
Bobby Hackett              Blues With A Kick                                 
Bobby Hackett              Coast Concert                                     
Bobby Hackett              Creole Cookin'                                    
Bobby Hackett              In Concert                                        
George Wettling's Jazz Ba  Jazz Session                                      
Jimmy Forrest              most much!                                        
Jimmy Forrest              Night Train                                       
Jimmy Forrest              Sit Down And Relax With Jimmy Forrest             
Jimmy Forrest              Soul Street                                       
Jimmy Forrest              Out of the Forrest (1961)                         
Jimmy Forrest              Forrest Fire                                      
Jimmy Forrest              Black Forrest                                     
Mary Lou Williams          1927-1940                                         
Mary Lou Williams          Mary Lou Williams Presents Black Christ of the And
                          . . .

Queries like this are surprisingly fast - they are exactly what the database has been designed and optimized to acheive.

There are a few conveniences that you can use in SQL to save typing and improve the appearance of the output. You can use the AS keyword to provide an alternative name for a database for use within a query, or to rename a column so that it will display with a different heading. The previous query could have been written:

sqlite> SELECT ar.name AS Artist, al.title AS 'Album Name'
   ...>        FROM Artist AS ar, Album AS al
   ...>        WHERE ar.aid = al.aid LIMIT 200,10;
Artist                     Album Name                                        
-------------------------  --------------------------------------------------
Mundell Lowe               TransitWest                                       
Mundell Lowe               A Grand Night For Swinging                        
Mundell Lowe               The Mundell Lowe Quartet                          
Duke Ellington             Jazz Violin Session                               
Duke Ellington             Happy-Go-Lucky Local                              
Duke Ellington             The Great Paris Concert - Disc 2                  
Duke Ellington             The Great Paris Concert                           
Duke Ellington             Piano Reflections                                 
Duke Ellington             Continuum                                         
Duke Ellington             At Newport

Note the new headings that appear in the output. The LIMIT keyword allows you to display just some of the results. Given a single number, it prints that many lines starting from the beginning; given two numbers it starts at the first number and prints as many records as specified in the second number. Combining two data sets in this way is a very common operation, known as an inner join. An alternative way of expressing the query we're looking at would be as follows:

SELECT ar.name AS Artist, al.title AS 'Album Name'
         FROM Artist AS ar inner join  Album AS al
         using(aid) limit 200,10;

When the two tables being joined share exactly one common column (key), the operation is sometimes known as a natural join. Since this is the case with the Album and Artist tables, yet another way to express the query is

SELECT ar.name AS Artist, al.title AS 'Album Name'
         FROM Artist AS ar natural join  Album AS al limit 200,10;

Note that if the key used to join two tables does not have the same name in each of the tables, you must use the previous method that specifies the matching condition in the WHERE clause.

SQL provides some convenient facilities for aggregating values based on counts, sums, averages and minimums and maximums, through the COUNT, SUM, AVG, MIN and MAX functions. To find the number of records that match a certain condition, the expression COUNT(*) can be used. For example, to find the number of observations in the Track data set, we could use:

sqlite> SELECT COUNT(*) from Track;
COUNT(*)                 
-------------------------
54396

To count how many tracks were longer than 1000, we could use

sqlite> SELECT COUNT(*) from Track WHERE length > 1000;
COUNT(*)                 
-------------------------
129

To find the track with the longest length in the Track table, we could try:

sqlite> SELECT MAX(length) from Track;
MAX(length)              
-------------------------
2049

This gives us the length of the maximum track, but not its title. The easiest way to get around this is to use a subquery. In SQL, a subquery is a SELECT statement surrounded by parentheses. You can use the values returned by the subquery in WHERE or HAVING clauses. So to find the track with the maximum length, we could use

sqlite> SELECT title,length FROM Track WHERE length = (SELECT max(length) FROM Track);
title                      length                                            
-------------------------  --------------------------------------------------
Hues Of Melanin            2049

Of course, this query would be more useful if we knew the name of the album and artist that the track came from:

sqlite> .width 20 25 25 5
sqlite> SELECT tr.title,al.title,ar.name,tr.length
   ...>    FROM Track as tr, Album as al, Artist as ar 
   ...>    WHERE tr.alid = al.alid and tr.aid = al.aid and ar.aid = al.aid
   ...>    AND length = (SELECT max(length) FROM Track);
title                 title                      name                       lengt
--------------------  -------------------------  -------------------------  -----
Hues Of Melanin       Live                       Sam Rivers Trio            2049

When you construct queries like this, it's extremely important to list all of the WHERE conditions that need to be satisfied to avoid getting meaningless duplicate records.

As another example of a subquery, suppose we want to count how many albums we have for each artist, and then tabulate how many artists there are with one album, two albums, three albums, and so on. The first part is fairly easy:

sqlite> SELECT aid,count(aid) FROM Album GROUP BY aid limit 100;
aid                   count(aid)               
--------------------  -------------------------
1                     1                        
2                     1                        
3                     1                        
4                     1                        
5                     1                        
6                     1                        
       . . .

We can use this result as a subquery to see the distribution of albums per artist:

mysql> SELECT ct,count(ct) FROM 
    ->      (SELECT aid,count(aid) AS ct FROM Album GROUP BY aid)
    ->      GROUP by ct ORDER by ct;
ct                    count(ct)                
--------------------  -------------------------
1                     2303                     
2                     302                      
3                     137                      
4                     72                       
5                     46                       
6                     36                       
7                     25                       
8                     25                       
9                     15               
       . . .

When using subqueries in this way, some databases may require that the subquery is given an alias. In such a case, a query like the following could be used:

SELECT ct,count(ct) FROM
      (SELECT aid,count(aid) AS ct FROM Album GROUP BY aid) AS x
      GROUP by ct ORDER by ct;

One of the most useful features of SQL is that it can summarize large data sets, producing a smaller result set that can be more easily handled by other programs. For example, suppose we are interested in the total length (in seconds) of the tracks on each album. Instead of processing all the tracks outside of the database, we can create a summarized version of the database that has the lengths already summarized. Very few programs can operate on databases as fast as SQL, so operations like this can save lots of processing time. To perform operations like this we can use the GROUP BY clause:

sqlite> SELECT alid,SUM(length) AS TotalTime FROM Track GROUP BY alid limit 50;
alid                  TotalTime                
--------------------  -------------------------
1                     2551                     
2                     3710                     
3                     2402                     
4                     3339                     
5                     3588                     
6                     3207                     
7                     4268                     
8                     4409                     
9                     4120                     
10                    4249                    
         . . .

Once again, the results would probably be more useful with the title of the album and the artist included. In this example, we can also sort them by the total time of the album:

sqlite> .header on
sqlite> .mode column
sqlite> .width 20 25 25 5
sqlite> SELECT ar.name,al.title,SUM(tr.length) AS TotalTime
   ...>     FROM Track as tr, Artist as ar,Album as al
   ...>     WHERE ar.aid = al.aid AND al.alid = tr.alid
   ...>     GROUP BY tr.alid
   ...>     ORDER by TotalTime DESC limit 100;
ar.name               al.title                   TotalTime                
--------------------  -------------------------  -------------------------
Jimmy Giuffre/Marty   West Coast Scene           13856                    
Curtis Fuller         Complete Blue Note UA Ses  11529                    
Clifford Brown        At The Cotton Club (1956)  11282                    
Count Basie           American Decca Recordings  11261                    
Onzy Matthews         Mosaic Select: Onzy Matth  9624                     
Cannonball Adderley   Sophisticated Swing        9486                     
Thelonius Monk        Live at the It Club        9289                     
                  . . .

In this particular example, instead of processing the 53000 or so tracks outside of SQL, we've reduced what we have to work with to around 5000 records, which could make the overall processing much faster. Exactly which aggregation functions will be available depends on the database you're using.

File translated from T_EX by T_TH, version 3.67.
On 26 Mar 2011, 09:05.