Databases
1 Databases
A database is a collection of data, usually with some information (sometimes called
metadata) about how the data is organized. But many times when people refer to a
database, they mean a database server, which is similar to a web server, but responds
to requests for data instead of web pages.
By far the most common type of database server is known a a relational database
management system (RDBMS), and the most common way of communicating with such a database
is a language known as SQL, which is an acronym for Structured Query Language.
Some examples of database systems that use SQL to communicate with an RDBMS include
Oracle, Sybase, Microsoft SQL Server, SQLite, MySQL and Postgres. While there is an SQL
standard, each database manufacturer provides some additional features, so SQL that
works with one database is not guaranteed to work on another. We'll try to stick to
aspects of SQL that should be available on most SQL based systems.
A Database consists of one or more tables, which are generally stored as files on the
computer that's running the DBMS. A table is a rectangular array, where each row
represents an observation, and each column represents a variable. Most databases
consists of several tables, each containing information about one aspect of the data.
For example, consider a database to hold information about the parts needed to build
a variety of different products. One way to store the information would be to have
a data table that had the part id, its description, the supplier's name, and the
cost of the part. An immediate problem with this is scheme concerns how we could store
information about the relation between products and parts. Furthermore, if one supplier
provided many different parts, redundant information
about suppliers would be repeated many times in the data table. In the 1970s when
database technology was first being developed, disk and memory space were limited and
expensive, and organizing data in this way was not very efficient, especially as the
size of the data base increased. In the late 1970s, the idea of a relational database
was developed by IBM, with the first commercial offering coming from a company which is
now known as Oracle. Relational database design is governed by a principle known as
normalization. While entire books are devoted to the subject, the basic idea of normalization
is to try and remove redundancy as much as possible when creating the tables that make up
a data base. Continuing the parts example, a properly normalized database to hold the
parts information would consist of four tables. The first would contain a part id to uniquely
identify the part, its description, price and an id to identify the supplier. A second table
would contain the supplier codes and any other information about the supplier. A third table
would contain product ids and descriptions, while the final table would have one record for
each part used in each product, stored as pairs of product id and part id.
The variables that link together the different databases are refered to as keys, or sometimes
foreign keys. Clearly, making sure that there are keys to link information from one table to
another is critical to the idea of a normalized data base.
This allows large
amounts of information to be stored in manageably sized tables which can be modified and updated
without having to change all of the information in all of the tables. Such tables will be
efficient in the amount of disk and memory resources that they need, which was critical at the
time such databases were developed. Also critical to this scheme is a fast and efficient way of
joining together these tables so that queries like "Which suppliers are used for product xyz?"
or "What parts from Acme Machine Works cost between $2 and $4" or "What is the total cost of
the parts needed to make product xyz?". In fact, for many years the only programs that were
capable of combining data from multiple sources were RDBMSs.
We're going to look at the way the SQL language is used to extract information from RDBMSs,
specifically the open source SQLite data base (http://sqlite.org)
(which doesn't require a database server), and
the open source MySQL data base (http://www.mysql.com) which does.
In addition we'll see how to do typical database operations in R,
and how to access a database using SQL from inside of R.
One of the first questions that arises when thinking about databases is "When should I
use a database?" There are several cases where a database makes sense:
- The only source for the data you need may be an existing database.
-
The data you're working with changes on a regular basis, especially if it can
potentially be changed by many people. Database servers (like most other servers)
provide concurrency, which takes care of problems that might otherwise arise if
more than one person tries to access or modify a database at the same time.
-
Your data is simply too big to fit into the computer's memory, but if you could
conveniently extract just the parts you need, you'd be able to work with the data.
For example, you may have hundreds of variables, but only need to work with a small
number at at time, or you have tens of thousands of observations, but only need to work
with a few thousand at a time.
One of the most important uses of databases in the last ten years has been as back
ends to dynamic web sites. If you've ever placed an order online, joined an online
discussion community, or gotten a username/password combination to access web
resources, you've most likely interacted with a database. In addition both the
Firefox and Safari browsers us
SQLite databases to store cookies, certificates, and other information.
There are actually three parts to a RDBMS system: data definition, data access, and
privelege management. We're only going to look at the data access aspect of databases;
we'll assume that a database is already available, and that the administrator of the
database has provided you with access to the resources you need. You can communicate
with a database in at least three different ways:
- You can use a command line program that will display its results in a way similar to
the way R works: you type in a query, and the results are displayed.
-
You can use a graphical interface (GUI) that will display the results and give you the
option to export them in different forms, such as a comma-separated data file. One
advantage to these clients is that they usually provide a display of the available
databases, along with the names of the tables in the database and the names of the columns
within the tables.
-
You can use a library within R to query the database and return the results of the query
into an R data frame. For use with MySQL library, the RMySQL library is available for
download from CRAN; for SQLite, the RSQLite package is available. When you install either,
the DBI
library, which provides a consistent interface to different databases, will also be installed.
It's very important to understand that SQL is not a programming language - it's said
to be a declarative language. Instead of solving problems by writing a series of
instructions or putting together programming statements into functions, with SQL you
write a single statement (query) that will return some or all of the records from a
database. SQL was designed to be easy to use for non-programmers, allowing them to
present queries to the database in a language that is more similar to a spoken language
than a programming language. The main tool for data access is the select statement.
Since everything needs to be done with this single statement, descriptions of its syntax
can be quite daunting. For example, here's an example that shows some of the important features
of the select statement:
SELECT columns or computations
FROM table
WHERE condition
GROUP BY columns
HAVING condition
ORDER BY column [ASC | DESC]
LIMIT offset,count;
(The keywords in the above statement can be typed in upper or lower case, but I'll generally
use upper case to remind us that they are keywords.) MySQL is case sensistive to table and
column names, but not for commands. SQLite is not case sensitive at all.
SQL statements always end with a semi-colon,
but graphical clients and interfaces in other languages sometimes don't require the semi-colon.
Fortunately, you rarely need to use all of these commands in a single query. In fact,
you may be able to get by knowing just the command that returns all of the information
from a table in a database:
SELECT * from table;
2 Working with Databases
When you're not using a graphical interface, it's easy to get lost in a database considering
that each database can hold multiple tables, and each table can have multiple columns. Three
commands that are useful in finding your way around a database are:
MySQL | SQLite |
SHOW DATABASES; | .databases |
SHOW TABLES IN database; | .tables |
SHOW COLUMNS IN table; | pragma table_info(table) |
DESCRIBE table; | .schema table |
For use as an example, consider a database to keep track of a music collection, stored as
an SQLite database in the file albums.db. We can use the commands just shown to see how the
database is organized. In class, I sometimes used some of the graphical SQLite clients that are
mentioned on the Class Resources page. Here, I'm using the UNIX command-line client
called sqlite3:
springer.spector$ sqlite3 albums.db
SQLite version 3.4.2
Enter ".help" for instructions
sqlite> .mode column
sqlite> .header on
sqlite> .tables
Album Artist Track
title TEXT
);
sqlite> .schema Album
CREATE TABLE Album
( aid INTEGER,
alid INTEGER,
title TEXT
);
CREATE INDEX jj on Album(alid);
sqlite> .schema Artist
CREATE TABLE Artist
( aid INTEGER,
name TEXT
);
CREATE INDEX ii on Artist(aid);
Studying these tables we can see that the organization is as follows: each track
in the Track table
is identified by an artist id (aid) and an album id (alid). To
get the actual names of the artists, we can link to the Artist database
through the aid key; to get the names of the albums, we can link to the
Album table through the alid key. These keys are integers, which
don't take up very much space; the actual text of the album title or artist's name
is stored only once in the table devoted to that purpose. (This is the basic
principle of normalization that drives database design.) The first thing to notice
in schemes like this is that the individual tables won't allow us to really see
anything iteresting, since they just contain keys, not actual values; to get interesting
information, we need to join together multiple tables. For example, let's say we want
a list of album titles and artist names. We want to extract the Name from the
Artist table and combine it with Title from the Album table,
based on a matching value of the aid variable:
sqlite> .width 25 50
sqlite> SELECT Artist.name, Album.title FROM Artist, Album WHERE Artist.aid = Album.aid;
Gloria Wood Wood by the Fire
Bobby Hackett Jazz Session
Bobby Hackett 1943-1947
Bobby Hackett Sextet
Bobby Hackett Blues With A Kick
Bobby Hackett Coast Concert
Bobby Hackett Creole Cookin'
Bobby Hackett In Concert
George Wettling's Jazz Ba Jazz Session
Jimmy Forrest most much!
Jimmy Forrest Night Train
Jimmy Forrest Sit Down And Relax With Jimmy Forrest
Jimmy Forrest Soul Street
Jimmy Forrest Out of the Forrest (1961)
Jimmy Forrest Forrest Fire
Jimmy Forrest Black Forrest
Mary Lou Williams 1927-1940
Mary Lou Williams Mary Lou Williams Presents Black Christ of the And
. . .
Queries like this are surprisingly fast - they are exactly what the database has been designed
and optimized to acheive.
There are a few conveniences that you can use in SQL to save typing and improve the appearance
of the output. You can use the AS keyword to provide an alternative name for a database
for use within a query, or to rename a column so that it will display with a different heading.
The previous query could have been written:
sqlite> SELECT ar.name AS Artist, al.title AS 'Album Name'
...> FROM Artist AS ar, Album AS al
...> WHERE ar.aid = al.aid LIMIT 200,10;
Artist Album Name
------------------------- --------------------------------------------------
Mundell Lowe TransitWest
Mundell Lowe A Grand Night For Swinging
Mundell Lowe The Mundell Lowe Quartet
Duke Ellington Jazz Violin Session
Duke Ellington Happy-Go-Lucky Local
Duke Ellington The Great Paris Concert - Disc 2
Duke Ellington The Great Paris Concert
Duke Ellington Piano Reflections
Duke Ellington Continuum
Duke Ellington At Newport
Note the new headings that appear in the output.
The LIMIT keyword allows
you to display just some of the results. Given a single number, it prints that many
lines starting from the beginning; given two numbers it starts at the first number and
prints as many records as specified in the second number. Combining two data sets in
this way is a very common operation, known as an inner join. An alternative way of
expressing the query we're looking at would be as follows:
SELECT ar.name AS Artist, al.title AS 'Album Name'
FROM Artist AS ar inner join Album AS al
using(aid) limit 200,10;
When the two tables being joined share exactly one common column (key), the operation
is sometimes known as a natural join. Since this is the case with the Album
and Artist tables, yet another way to express the query is
SELECT ar.name AS Artist, al.title AS 'Album Name'
FROM Artist AS ar natural join Album AS al limit 200,10;
Note that if the key used to join two tables does not have the same name in each of the
tables, you must use the previous method that specifies the matching condition in the
WHERE clause.
SQL provides some convenient facilities for aggregating values based on counts, sums,
averages and minimums and maximums, through the COUNT, SUM, AVG,
MIN and MAX functions.
To find the number of records that match a certain condition, the expression
COUNT(*) can be used. For example, to find the number of observations in
the Track data set, we could use:
sqlite> SELECT COUNT(*) from Track;
COUNT(*)
-------------------------
54396
To count how many tracks were longer than 1000, we could use
sqlite> SELECT COUNT(*) from Track WHERE length > 1000;
COUNT(*)
-------------------------
129
To find the track with the longest
length in the Track table, we could try:
sqlite> SELECT MAX(length) from Track;
MAX(length)
-------------------------
2049
This gives us the length of the maximum track, but not its title. The easiest way to
get around this is to use a subquery. In SQL, a subquery is a SELECT statement
surrounded by parentheses. You can use the values returned by the subquery in WHERE
or HAVING clauses. So to find the track with the maximum length, we could use
sqlite> SELECT title,length FROM Track WHERE length = (SELECT max(length) FROM Track);
title length
------------------------- --------------------------------------------------
Hues Of Melanin 2049
Of course, this query would be more useful if we knew the name of the album and artist that
the track came from:
sqlite> .width 20 25 25 5
sqlite> SELECT tr.title,al.title,ar.name,tr.length
...> FROM Track as tr, Album as al, Artist as ar
...> WHERE tr.alid = al.alid and tr.aid = al.aid and ar.aid = al.aid
...> AND length = (SELECT max(length) FROM Track);
title title name lengt
-------------------- ------------------------- ------------------------- -----
Hues Of Melanin Live Sam Rivers Trio 2049
When you construct queries like this, it's extremely important to list all of
the WHERE conditions that need to be satisfied to avoid getting meaningless
duplicate records.
As another example of a subquery, suppose we want to count how many albums we have
for each artist, and then tabulate how many artists there are with one album, two
albums, three albums, and so on. The first part is fairly easy:
sqlite> SELECT aid,count(aid) FROM Album GROUP BY aid limit 100;
aid count(aid)
-------------------- -------------------------
1 1
2 1
3 1
4 1
5 1
6 1
. . .
We can use this result as a subquery to see the distribution of albums per artist:
mysql> SELECT ct,count(ct) FROM
-> (SELECT aid,count(aid) AS ct FROM Album GROUP BY aid)
-> GROUP by ct ORDER by ct;
ct count(ct)
-------------------- -------------------------
1 2303
2 302
3 137
4 72
5 46
6 36
7 25
8 25
9 15
. . .
When using subqueries in this way, some databases may require that the
subquery is given an alias. In such a case, a query like the following
could be used:
SELECT ct,count(ct) FROM
(SELECT aid,count(aid) AS ct FROM Album GROUP BY aid) AS x
GROUP by ct ORDER by ct;
One of the most useful features of SQL is that it can summarize large data
sets, producing a smaller result set that can be more easily handled by other programs.
For example, suppose we are interested in the total length (in seconds) of the tracks on
each album. Instead of processing all the tracks outside of the database, we can create
a summarized version of the database that has the lengths already summarized. Very few
programs can operate on databases as fast as SQL, so operations like this can save lots of
processing time. To perform operations like this we can use the GROUP BY clause:
sqlite> SELECT alid,SUM(length) AS TotalTime FROM Track GROUP BY alid limit 50;
alid TotalTime
-------------------- -------------------------
1 2551
2 3710
3 2402
4 3339
5 3588
6 3207
7 4268
8 4409
9 4120
10 4249
. . .
Once again, the results would probably be more useful with the title of the album and the
artist included. In this example, we can also sort them by the total time of the album:
sqlite> .header on
sqlite> .mode column
sqlite> .width 20 25 25 5
sqlite> SELECT ar.name,al.title,SUM(tr.length) AS TotalTime
...> FROM Track as tr, Artist as ar,Album as al
...> WHERE ar.aid = al.aid AND al.alid = tr.alid
...> GROUP BY tr.alid
...> ORDER by TotalTime DESC limit 100;
ar.name al.title TotalTime
-------------------- ------------------------- -------------------------
Jimmy Giuffre/Marty West Coast Scene 13856
Curtis Fuller Complete Blue Note UA Ses 11529
Clifford Brown At The Cotton Club (1956) 11282
Count Basie American Decca Recordings 11261
Onzy Matthews Mosaic Select: Onzy Matth 9624
Cannonball Adderley Sophisticated Swing 9486
Thelonius Monk Live at the It Club 9289
. . .
In this particular example, instead of processing the 53000 or so
tracks outside of
SQL, we've reduced what we have to work with to around 5000 records, which could make
the overall processing much faster. Exactly which aggregation functions will be
available depends on the database you're using.
File translated from
TEX
by
TTH,
version 3.67.
On 26 Mar 2011, 09:05.