XML
1 A Simple Example
Let's start with a very simple example to introduce the XML library and some of
the strategies we'll need to use when working with XML. The example we'll use is
a catalog of plants. (You can find the XML document at
http://www.stat.berkeley.edu/~spector/s133/data/plant_catalog.xml.)
Here's what the beginning of the file looks like:
<?xml version="1.0" encoding="ISO-8859-1"?>^M
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->^M
<CATALOG>^M
<PLANT>^M
<COMMON>Bloodroot</COMMON>^M
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>^M
<ZONE>4</ZONE>^M
<LIGHT>Mostly Shady</LIGHT>^M
<PRICE>$2.44</PRICE>^M
<AVAILABILITY>031599</AVAILABILITY>^M
</PLANT>^M
<PLANT>^M
<COMMON>Columbine</COMMON>^M
<BOTANICAL>Aquilegia canadensis</BOTANICAL>^M
<ZONE>3</ZONE>^M
<LIGHT>Mostly Shady</LIGHT>^M
<PRICE>$9.37</PRICE>^M
<AVAILABILITY>030699</AVAILABILITY>^M
</PLANT>^M
. . .
The main body of the data is enclosed by the CATALOG tag; we'll sometimes
refer to this as a node.
Within the CATALOG node are several PLANT nodes - each of them
contains several pieces of data which are the information we wish to extract.
We're going to use the R XML library with what is known as the Document Object Model
(DOM), in which R will read the entire XML file into memory, and internally convert it to a
tree structure. (There is another model, called SAX (Simple API for XML), which only
reads part of the data at a time, but it's more complicated than the DOM model.)
All the information about the data and its structure will be stored
in a list inside of R, but when we print it, it will look very much like the document
that we started with. The basic strategy of working with XML files under this model
is to keep "tunneling down" into the document structure without disturbing its
structure until we come to the end of the node. For example, with the plant catalog,
we want to extract the PLANT nodes as XML structures, and then extract the
values of the terminal branches of the tree (like COMMON, BOTANICAL,
and ZONE) using the xmlValue function. As we explore the structure
of the data, we can use the names function to see what nodes exist; at any
time, we can print a node, and it will show us a representation like the original XML
document so we'll know where to go next.
The first step in reading an XML document into R is loading the XML library.
Next, the xmlTreeParse function is called to read the document into memory,
and the root of the XML tree is extracted using xmlRoot:
> library(XML)
> doc = xmlTreeParse('plant_catalog.xml')
> root = xmlRoot(doc)
Let's look at some properties of this root node:
> class(root)
[1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
> names(root)
> names(root)
[1] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[10] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[19] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[28] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
In this example, xmlTreeParse read its input from a local file.
Other choices are local gzipped files and URLs representing XML
documents or gzipped XML documents.
The class of XMLNode indicates that internally, R is storing the object
as part of an XML tree. This will be important later when we decide how to
process the entire tree. The result of the names function call tells us
that the structure of this document is quite simple. Inside the root node, there
are 36 PLANT nodes. (For larger documents, it might be prudent to make
a table of the names instead of displaying the names directly.) As we've already seen
from looking at the document,
this is where our data lies, so we'll want to examine one of the plant nodes a
little more carefully to figure out how to extract the actual data.
> oneplant = root[[1]]
> class(oneplant)
[1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
> oneplant
<PLANT>
<COMMON>
Bloodroot
</COMMON>
<BOTANICAL>
Sanguinaria canadensis
</BOTANICAL>
<ZONE>
4
</ZONE>
<LIGHT>
Mostly Shady
</LIGHT>
<PRICE>
$2.44
</PRICE>
<AVAILABILITY>
031599
</AVAILABILITY>
</PLANT>
We can see that this single PLANT object is still an XMLnode; its
printed representation shows us exactly what we've got. Notice that the individual
elements don't have any further tree structure; this means we can use xmlValue
to extract the values:
> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"
Of course, we don't want to perform these tasks one by one for each node of the
tree. You may recall that when we were confronted with problems like this before,
we used the sapply function, which operates on every element of a list.
If the object we want to process is an xmlNode, the corresponding function to
use is xmlSApply. For example, to get all the common names of all the plants,
we can use xmlValue on all the PLANT nodes like this:
> commons = xmlSApply(root,function(x)xmlValue(x[['COMMON']]))
> head(commons)
PLANT PLANT PLANT
"Bloodroot" "Columbine" "Marsh Marigold"
PLANT PLANT PLANT
"Cowslip" "Dutchman's-Breeches" "Ginger, Wild"
We could repeat the process for each column manually, and then combine things into
a data frame, or we can automate the process using lapply and the names
of the objects within the plants:
> getvar = function(x,var)xmlValue(x[[var]])
> res = lapply(names(root[[1]]),function(var)xmlSApply(root,getvar,var))
> plants = data.frame(res)
> names(plants) = names(root[[1]])
> head(plants)
COMMON BOTANICAL ZONE LIGHT PRICE
1 Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44
2 Columbine Aquilegia canadensis 3 Mostly Shady $9.37
3 Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81
4 Cowslip Caltha palustris 4 Mostly Shady $9.90
5 Dutchman's-Breeches Dicentra cucullaria 3 Mostly Shady $6.44
6 Ginger, Wild Asarum canadense 3 Mostly Shady $9.03
AVAILABILITY
1 031599
2 030699
3 051799
4 030699
5 012099
6 041899
2 More Complex Example
For more complex documents, a few other tools are useful. To illustrate, we'll look
at a file that uses Geographic Markup Language, or GML.
This file (which you can find
at http://www.stat.berkeley.edu/~spector/s133/data/counties.gml
contains the x- and y-coordinates of the county centers for each state in the
United States. Information like this would be difficult to store in a less structured
environment, because each state has a different number of counties. If we were going
to read it into a database, we might want to have a separate table for each state;
in some programs, we'd have to force it into a rectangular form, even though it wouldn't
be that efficient. If we were using a spreadsheet, we might want to put all the information
in a single spreadsheet, with a separate heading for each state.
In R, a reasonable way to store the data would be in a list, with a separate data frame
for each state.
Thus, providing the
data in a form that could be easily converted to any of those formats is essential, and
that's just what XML does for us.
The first steps are the same as the previous example; loading the library, parsing the
tree, and extracting the root.
> doc = xmlTreeParse('counties.gml')
> root = xmlRoot(doc)
To see what's in the root, we can get a table of the names found there:
> table(names(root))
state
51
Let's extract the first state node for further study:
> onestate = root[[1]]
> class(onestate)
[1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
> table(names(onestate))
county name
67 1
Here's what the onestate object looks like - I've truncated it so
that it only displays a single county, but we can still see the general
structure:
<state>
<gml:name abbreviation="AL">ALABAMA</gml:name>
<county>
<gml:name>Autauga County</gml:name>
<gml:location>
<gml:coord>
<gml:X>-86641472</gml:X>
<gml:Y>32542207</gml:Y>
</gml:coord>
</gml:location>
</county>
. . .
The name element (labeled as gml:name) is just the name of the
state. We can extract them from all of the states using xmlSApply:
> statenames = xmlSApply(root,function(x)xmlValue(x[['name']]))
> head(statenames)
state state state state state state
"ALABAMA" "ALASKA" "ARIZONA" "ARKANSAS" "CALIFORNIA" "COLORADO"
Note that in this example there is an attribute in the name tag, namely the
state abbreviation. To access these attributes we can use the xmlAttrs
function in a fashion similar to xmlValue:
> stateabbs = xmlSApply(root,function(x)xmlAttrs(x[['name']]))
> head(stateabbs)
state.abbreviation state.abbreviation state.abbreviation state.abbreviation
"AL" "AK" "AZ" "AR"
state.abbreviation state.abbreviation
"CA" "CO"
Since there was only one attribute, xmlSApply was used to extract
directly into a vector. If there were multiple attributes, then xmlApply
would need to be used, since it will return a list of attributes, preserving the
structure of the data.
To process the county data further, we need to extract just the county
nodes. The xmlElementsByTagName function will extract just the elements
that we want:
> counties = xmlElementsByTagName(onestate,'county')
> class(counties)
[1] "list"
> length(counties)
[1] 67
Notice that extracting the elements in this way results in a list, not an xmlNode.
But what's in the list?
> onecounty = counties[[1]]
> class(onecounty)
[1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
> names(onecounty)
[1] "name" "location"
> onecounty
<county>
<gml:name>
Autauga County
</gml:name>
<gml:location>
<gml:coord>
<gml:X>
-86641472
</gml:X>
<gml:Y>
32542207
</gml:Y>
</gml:coord>
</gml:location>
</county>
The elements inside the counties list are still xmlNodes. The fact
that they are contained in a list simply means that we'll use sapply
or lapply to process them instead of xmlSApply or xmlApply.
What we really want are the X and Y values within the coord
nodes, so let's extract out those nodes from the full list of county nodes:
> coords = lapply(counties,function(x)x[['location']][['coord']])
> class(coords)
[1] "list"
> class(coords[[1]])
[1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"
> coords[[1]]
<gml:coord>
<gml:X>
-86641472
</gml:X>
<gml:Y>
32542207
</gml:Y>
</gml:coord>
Since there is only one coord node within the location nodes, I
extracted it directly. I could also have used xmlGetElementsByTagName, and
used the first element of the resulting list:
coords = lapply(counties,function(x)xmlElementsByTagName(x[['location']],'coord')[[1]])
Notice that I used lapply to extract the coord nodes. Since
xmlNodes are represented internally as lists, I would have lost the structure
of the data if I used sapply in this case.
Now we can extract the x and y values using xmlValue:
> x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
> y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))
That shows the process for extracting the county names and x- and y-coordinates for
a single state. Let's summarize the steps in a function, which we can then apply to
the root of the tree to get a separate list for each state:
onestate = function(state){
counties = xmlElementsByTagName(state,'county')
countynames = sapply(counties,function(x)xmlValue(x[['name']]))
coords = lapply(counties,function(x)x[['location']][['coord']])
x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))
data.frame(county=countynames,x=x,y=y)
}
To combine everything together, and create a list with one data frame per state,
we can do the following:
> res = xmlApply(root,onestate)
> names(res) = xmlSApply(root,function(x)xmlValue(x[['name']]))
Although these examples may seem simplistic, there are usable XML formats that are
this simple, or even simpler. For example, at the web site
http://www.weather.gov/data/current_obs/
there is a description of an XML format used to distribute weather information. Basically,
the XML document is available at http://www.weather.gov/data/current_obs/XXXX.xml
where XXXX represents the four letter weather station code, usually related to a
nearby airport.
A quick look at the data reveals that the format is very simple:
> library(XML)
> doc = xmlTreeParse('http://www.weather.gov/data/current_obs/KOAK.xml')
> root = xmlRoot(doc)
> names(root)
[1] "credit" "credit_URL"
[3] "image" "suggested_pickup"
[5] "suggested_pickup_period" "location"
[7] "station_id" "latitude"
[9] "longitude" "observation_time"
[11] "observation_time_rfc822" "weather"
[13] "temperature_string" "temp_f"
[15] "temp_c" "relative_humidity"
[17] "wind_string" "wind_dir"
[19] "wind_degrees" "wind_mph"
[21] "wind_gust_mph" "pressure_string"
[23] "pressure_mb" "pressure_in"
[25] "dewpoint_string" "dewpoint_f"
[27] "dewpoint_c" "heat_index_string"
[29] "heat_index_f" "heat_index_c"
[31] "windchill_string" "windchill_f"
[33] "windchill_c" "visibility_mi"
[35] "icon_url_base" "icon_url_name"
[37] "two_day_history_url" "ob_url"
[39] "disclaimer_url" "copyright_url"
[41] "privacy_policy_url"
Essentially, all the information is in the root node, so it can be
extracted with xmlValue:
> xmlValue(root[['temp_f']])
[1] "72"
> xmlValue(root[['wind_mph']])
[1] "6.9"
To make this easy to use, we can write a function that will allow us to specify
a location and some variables that we want information about:
getweather = function(loc='KOAK',vars='temp_f'){
require(XML)
url = paste('http://www.weather.gov/data/current_obs/',loc,'.xml',sep='')
doc = xmlTreeParse(url)
root = xmlRoot(doc)
sapply(vars,function(x)xmlValue(root[[x]]))
}
Let's check to make sure it works with the defaults:
> getweather()
temp_f
"76"
That seems to be ok. To make it even more useful, we can create a vector
of station names, and use sapply to find weather information for each station:
> result = sapply(c('KOAK','KACV','KSDM'),getweather,var=c('temp_f','wind_mph','wind_dir','relative_humidity'))
> data.frame(t(result))
temp_f wind_mph wind_dir relative_humidity
KOAK 78 9.2 Northwest 35
KACV 58 17.25 North 75
KSDM 85 8.05 West 14
sapply has properly labeled all the stations and variables.
File translated from
TEX
by
TTH,
version 3.67.
On 18 Mar 2009, 15:20.