Databases 
 
1  Databases
A database is a collection of data, usually with some information (sometimes called
metadata) about how the data is organized. But many times when people refer to a 
database, they mean a database server, which is similar to a web server, but responds
to requests for data instead of web pages. 
By far the most common type of database server is known a a relational database 
management system (RDBMS), and the most common way of communicating with such a database
is a language known as SQL, which is an acronym for Structured Query Language.  
Some examples of database systems that use SQL to communicate with an RDBMS include
Oracle, Sybase, Microsoft SQL Server, SQLite, MySQL and Postgres.  While there is an SQL 
standard, each database manufacturer provides some additional features, so SQL that
works with one database is not guaranteed to work on another.  We'll try to stick to
aspects of SQL that should be available on most SQL based systems.
A Database consists of one or more tables, which are generally stored as files on the 
computer that's running the DBMS.  A table is a rectangular array, where each row
represents an observation, and each column represents a variable.  Most databases 
consists of several tables, each containing information about one aspect of the data.
For example, consider a database to hold information about the parts needed to build
a variety of different products.  One way to store the information would be to have
a data table that had the part id, its description, the supplier's name, and the 
cost of the part.  An immediate problem with this is scheme concerns how we could store
information about the relation between products and parts.  Furthermore, if one supplier 
provided many different parts, redundant information 
about suppliers would be repeated many times in the data table.  In the 1970s when 
database technology was first being developed, disk and memory space were limited and 
expensive, and organizing data in this way was not very efficient, especially as the 
size of the data base increased.  In the late 1970s, the idea of a relational database
was developed by IBM, with the first commercial offering coming from a company which is
now known as Oracle.  Relational database design is governed by a principle known as 
normalization.  While  entire books are devoted to the subject, the basic idea of normalization
is to try and remove redundancy as much as possible when creating the tables that make up 
a data base.  Continuing the parts example, a properly normalized database to hold the 
parts information would consist of four tables.  The first would contain a part id to uniquely
identify the part, its description, price and an id to identify the supplier.  A second table
would contain the supplier codes and any other information about the supplier.  A third table
would contain product ids and descriptions, while the final table would have one record for 
each part used in each product, stored as pairs of product id and part id.  
The variables that link together the different databases are refered to as keys, or sometimes 
foreign keys.  Clearly, making sure that there are keys to link information from one table to
another is critical to the idea of a normalized data base.
This allows large
amounts of information to be stored in manageably sized tables which can be modified and updated
without having to change all of the information in all of the tables.  Such tables will be 
efficient in the amount of disk and memory resources that they need, which was critical at the 
time such databases were developed.  Also critical to this scheme is a fast and efficient way of 
joining together these tables so that queries like "Which suppliers are used for product xyz?"
or "What parts from Acme Machine Works cost between $2 and $4" or "What is the total cost of 
the parts needed to make product xyz?".  In fact, for many years the only programs that were 
capable of combining data from multiple sources were RDBMSs.  
We're going to look at the way the SQL language is used to extract information from RDBMSs,
specifically the open source SQLite data base (http://sqlite.org) 
(which doesn't require a database server), and
the open source MySQL data base (http://www.mysql.com) which does.
In addition we'll see how to do typical database operations in R, 
and how to access a database using SQL from inside of R.
One of the first questions that arises when thinking about databases is "When should I 
use a database?"  There are several cases where a database makes sense:
- The only source for the data you need may be an existing database.
 
- 
The data you're working with changes on a regular basis, especially if it can
potentially be changed by many people.  Database servers (like most other servers)
provide concurrency, which takes care of problems that might otherwise arise if 
more than one person tries to access or modify a database at the same time.
 
- 
Your data is simply too big to fit into the computer's memory, but if you could 
conveniently extract just the parts you need, you'd be able to work with the data.
For example, you may have hundreds of variables, but only need to work with a small
number at at time, or you have tens of thousands of observations, but only need to work
with a few thousand at a time.
 
One of the most important uses of databases in the last ten years has been as back
ends to dynamic web sites.  If you've ever placed an order online, joined an online
discussion community, or gotten a username/password combination to access web 
resources, you've most likely interacted with a database.  In addition both the 
Firefox and Safari browsers us 
SQLite databases to store cookies, certificates, and other information.
There are actually three parts to a RDBMS system: data definition, data access, and
privelege management.  We're only going to look at the data access aspect of databases;
we'll assume that a database is already available, and that the administrator of the 
database has provided you with access to the resources you need.   You can communicate 
with a database in at least three different ways:
- You can use a command line program that will display its results in a way similar to 
the way R works: you type in a query, and the results are displayed.
 
- 
You can use a graphical interface (GUI) that will display the results and give you the 
option to export them in different forms, such as a comma-separated data file.  One 
advantage to these clients is that they usually provide a display of the available 
databases, along with the names of the tables in the database and the names of the columns
within the tables.
 
- 
You can use a library within R to query the database and return the results of the query
into an R data frame.  For use with MySQL library, the RMySQL library is available for
download from CRAN; for SQLite, the RSQLite package is available.  When you install either, 
the DBI
library, which provides a consistent interface to different databases, will also be installed.
 
It's very important to understand that SQL is not a programming language - it's said
to be a declarative language.  Instead of solving problems by writing a series of 
instructions or putting together programming statements into functions, with SQL you
write a single statement (query) that will return some or all of the records from a 
database.  SQL was designed to be easy to use for non-programmers, allowing them to 
present queries to the database in a language that is more similar to a spoken language
than a programming language.  The main tool for data access is the select statement.
Since everything needs to be done with this single statement, descriptions of its syntax
can be quite daunting.  For example, here's an example that shows some of the important features 
of the select statement:
SELECT columns or computations
       FROM table
       WHERE condition
       GROUP BY columns
       HAVING condition
       ORDER BY column [ASC | DESC]
       LIMIT offset,count;
(The keywords in the above statement can be typed in upper or lower case, but I'll generally
use upper case to remind us that they are keywords.)  MySQL is case sensistive to table and
column names, but not for commands. SQLite is not case sensitive at all.
SQL statements always end with a semi-colon,
but graphical clients and interfaces in other languages sometimes don't require the semi-colon.
Fortunately, you rarely need to use all of these commands in a single query.  In fact, 
you may be able to get by knowing just the command that returns all of the information
from a table in a database:
       SELECT * from table;
 
2  Working with Databases
When you're not using a graphical interface, it's easy to get lost in a database considering
that each database can hold multiple tables, and each table can have multiple columns.  Three
commands that are useful in finding your way around a database are:
| MySQL | SQLite | 
| SHOW DATABASES; | .databases | 
| SHOW TABLES IN database; | .tables | 
| SHOW COLUMNS IN table; | pragma table_info(table) | 
| DESCRIBE table; | .schema table | 
For use as an example, consider a database to keep track of a music collection, stored as
an SQLite database in the file albums.db.  We can use the commands just shown to see how the
database is organized.  In class, I sometimes used some of the graphical SQLite clients that are
mentioned on the Class Resources page. Here, I'm using the UNIX command-line client 
called sqlite3:
springer.spector$ sqlite3 albums.db 
SQLite version 3.4.2
Enter ".help" for instructions
sqlite> .mode column
sqlite> .header on
sqlite> .tables
Album   Artist  Track 
	title TEXT 
);
sqlite> .schema Album
CREATE TABLE Album 
( aid INTEGER,
	alid INTEGER,
	title TEXT 
);
CREATE INDEX jj on Album(alid);
sqlite> .schema Artist
CREATE TABLE Artist 
( aid INTEGER,
	name TEXT 
);
CREATE INDEX ii on Artist(aid);
Studying these tables we can see that the organization is as follows:  each track 
in the Track table 
is identified by an artist id (aid) and an album id (alid).  To
get the actual names of the artists, we can link to the Artist database
through the aid key;  to get the names of the albums, we can link to the 
Album table through the alid key.  These keys are integers, which
don't take up very much space; the actual text of the album title or artist's name 
is stored only once in the table devoted to that purpose.  (This is the basic 
principle of normalization that drives database design.)  The first thing to notice 
in schemes like this is that the individual tables won't allow us to really see 
anything iteresting, since they just contain keys, not actual values; to get interesting
information, we need to join together multiple tables.  For example, let's say we want
a list of album titles and artist names.  We want to extract the Name from the 
Artist table and combine it with Title from the Album table,
based on a matching value of the aid variable:
sqlite> .width 25 50
sqlite> SELECT Artist.name, Album.title FROM Artist, Album WHERE Artist.aid = Album.aid;
Gloria Wood                Wood by the Fire                                  
Bobby Hackett              Jazz Session                                      
Bobby Hackett              1943-1947                                         
Bobby Hackett              Sextet                                            
Bobby Hackett              Blues With A Kick                                 
Bobby Hackett              Coast Concert                                     
Bobby Hackett              Creole Cookin'                                    
Bobby Hackett              In Concert                                        
George Wettling's Jazz Ba  Jazz Session                                      
Jimmy Forrest              most much!                                        
Jimmy Forrest              Night Train                                       
Jimmy Forrest              Sit Down And Relax With Jimmy Forrest             
Jimmy Forrest              Soul Street                                       
Jimmy Forrest              Out of the Forrest (1961)                         
Jimmy Forrest              Forrest Fire                                      
Jimmy Forrest              Black Forrest                                     
Mary Lou Williams          1927-1940                                         
Mary Lou Williams          Mary Lou Williams Presents Black Christ of the And
                          . . .
Queries like this are surprisingly fast - they are exactly what the database has been designed
and optimized to acheive.
There are a few conveniences that you can use in SQL to save typing and improve the appearance
of the output.  You can use the AS keyword to provide an alternative name for a database
for use within a query, or to rename a column so that it will display with a different heading.
The previous query could have been written:
sqlite> SELECT ar.name AS Artist, al.title AS 'Album Name'
   ...>        FROM Artist AS ar, Album AS al
   ...>        WHERE ar.aid = al.aid LIMIT 200,10;
Artist                     Album Name                                        
-------------------------  --------------------------------------------------
Mundell Lowe               TransitWest                                       
Mundell Lowe               A Grand Night For Swinging                        
Mundell Lowe               The Mundell Lowe Quartet                          
Duke Ellington             Jazz Violin Session                               
Duke Ellington             Happy-Go-Lucky Local                              
Duke Ellington             The Great Paris Concert - Disc 2                  
Duke Ellington             The Great Paris Concert                           
Duke Ellington             Piano Reflections                                 
Duke Ellington             Continuum                                         
Duke Ellington             At Newport                                   
 
Note the new headings that appear in the output.  
The LIMIT keyword allows
you to display just some of the results.  Given a single number, it prints that many
lines starting from the beginning; given two numbers it starts at the first number and
prints as many records as specified in the second number.  Combining two data sets in
this way is a very common operation, known as an inner join.  An alternative way of 
expressing the query we're looking at would be as follows:
SELECT ar.name AS Artist, al.title AS 'Album Name'
         FROM Artist AS ar inner join  Album AS al
         using(aid) limit 200,10;
When the two tables being joined share exactly one common column (key), the operation
is sometimes known as a natural join.  Since this is the case with the Album
and Artist tables, yet another way to express the query is
SELECT ar.name AS Artist, al.title AS 'Album Name'
         FROM Artist AS ar natural join  Album AS al limit 200,10;
Note that if the key used to join two tables does not have the same name in each of the
tables, you must use the previous method that specifies the matching condition in the 
WHERE clause.
SQL provides some convenient facilities for aggregating values based on counts, sums,
averages and minimums and maximums, through the COUNT, SUM, AVG,
MIN and MAX functions.  
To find the number of records that match a certain condition, the expression
COUNT(*) can be used.  For example, to find the number of observations in 
the Track data set, we could use:
sqlite> SELECT COUNT(*) from Track;
COUNT(*)                 
-------------------------
54396                  
To count how many tracks were longer than 1000, we could use
sqlite> SELECT COUNT(*) from Track WHERE length > 1000;
COUNT(*)                 
-------------------------
129                      
To find the track with the longest 
length in the Track table, we could try:
sqlite> SELECT MAX(length) from Track;
MAX(length)              
-------------------------
2049
This gives us the length of the maximum track, but not its title.  The easiest way to 
get around this is to use a subquery.  In SQL, a subquery is a SELECT statement 
surrounded by parentheses.  You can use the values returned by the subquery in WHERE
or HAVING clauses.  So to find the track with the maximum length, we could use
sqlite> SELECT title,length FROM Track WHERE length = (SELECT max(length) FROM Track);
title                      length                                            
-------------------------  --------------------------------------------------
Hues Of Melanin            2049                 
Of course, this query would be more useful if we knew the name of the album and artist that
the track came from:
sqlite> .width 20 25 25 5
sqlite> SELECT tr.title,al.title,ar.name,tr.length
   ...>    FROM Track as tr, Album as al, Artist as ar 
   ...>    WHERE tr.alid = al.alid and tr.aid = al.aid and ar.aid = al.aid
   ...>    AND length = (SELECT max(length) FROM Track);
title                 title                      name                       lengt
--------------------  -------------------------  -------------------------  -----
Hues Of Melanin       Live                       Sam Rivers Trio            2049 
When you construct queries like this, it's extremely important to list all of 
the WHERE conditions that need to be satisfied to avoid getting meaningless
duplicate records.  
As another example of a subquery, suppose we want to count how many albums we have 
for each artist, and then tabulate how many artists there are with one album, two
albums, three albums, and so on.  The first part is fairly easy:
sqlite> SELECT aid,count(aid) FROM Album GROUP BY aid limit 100;
aid                   count(aid)               
--------------------  -------------------------
1                     1                        
2                     1                        
3                     1                        
4                     1                        
5                     1                        
6                     1                        
       . . .
We can use this result as a subquery to see the distribution of albums per artist:
mysql> SELECT ct,count(ct) FROM 
    ->      (SELECT aid,count(aid) AS ct FROM Album GROUP BY aid)
    ->      GROUP by ct ORDER by ct;
ct                    count(ct)                
--------------------  -------------------------
1                     2303                     
2                     302                      
3                     137                      
4                     72                       
5                     46                       
6                     36                       
7                     25                       
8                     25                       
9                     15               
       . . .
When using subqueries in this way, some databases may require that the 
subquery is given an alias.  In such a case, a query like the following 
could be used:
SELECT ct,count(ct) FROM
      (SELECT aid,count(aid) AS ct FROM Album GROUP BY aid) AS x
      GROUP by ct ORDER by ct;
One of the most useful features of SQL is that it can summarize large data 
sets, producing a smaller result set that can be more easily handled by other programs.
For example, suppose we are interested in the total length (in seconds) of the tracks on
each album.  Instead of processing all the tracks outside of the database, we can create
a summarized version of the database that has the lengths already summarized.  Very few 
programs can operate on databases as fast as SQL, so operations like this can save lots of
processing time.  To perform operations like this we can use the GROUP BY clause:
sqlite> SELECT alid,SUM(length) AS TotalTime FROM Track GROUP BY alid limit 50;
alid                  TotalTime                
--------------------  -------------------------
1                     2551                     
2                     3710                     
3                     2402                     
4                     3339                     
5                     3588                     
6                     3207                     
7                     4268                     
8                     4409                     
9                     4120                     
10                    4249                    
         . . .
Once again, the results would probably be more useful with the title of the album and the 
artist included.  In this example, we can also sort them by the total time of the album:
sqlite> .header on
sqlite> .mode column
sqlite> .width 20 25 25 5
sqlite> SELECT ar.name,al.title,SUM(tr.length) AS TotalTime
   ...>     FROM Track as tr, Artist as ar,Album as al
   ...>     WHERE ar.aid = al.aid AND al.alid = tr.alid
   ...>     GROUP BY tr.alid
   ...>     ORDER by TotalTime DESC limit 100;
ar.name               al.title                   TotalTime                
--------------------  -------------------------  -------------------------
Jimmy Giuffre/Marty   West Coast Scene           13856                    
Curtis Fuller         Complete Blue Note UA Ses  11529                    
Clifford Brown        At The Cotton Club (1956)  11282                    
Count Basie           American Decca Recordings  11261                    
Onzy Matthews         Mosaic Select: Onzy Matth  9624                     
Cannonball Adderley   Sophisticated Swing        9486                     
Thelonius Monk        Live at the It Club        9289                     
                  . . . 
 
In this particular example, instead of processing the 53000 or so 
tracks outside of 
SQL, we've reduced what we have to work with to around 5000 records, which could make
the overall processing much faster.   Exactly which aggregation functions will be
available depends on the database you're using.  
File translated from
TEX
by 
TTH,
version 3.67.
On 26 Mar 2011, 09:05.