XML

1 A Simple Example

Let's start with a very simple example to introduce the XML library and some of the strategies we'll need to use when working with XML. The example we'll use is a catalog of plants. (You can find the XML document at http://www.stat.berkeley.edu/~spector/s133/data/plant_catalog.xml.) Here's what the beginning of the file looks like:

<?xml version="1.0" encoding="ISO-8859-1"?>^M
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->^M
<CATALOG>^M
        <PLANT>^M
                <COMMON>Bloodroot</COMMON>^M
                <BOTANICAL>Sanguinaria canadensis</BOTANICAL>^M
                <ZONE>4</ZONE>^M
                <LIGHT>Mostly Shady</LIGHT>^M
                <PRICE>$2.44</PRICE>^M
                <AVAILABILITY>031599</AVAILABILITY>^M
        </PLANT>^M
        <PLANT>^M
                <COMMON>Columbine</COMMON>^M
                <BOTANICAL>Aquilegia canadensis</BOTANICAL>^M
                <ZONE>3</ZONE>^M
                <LIGHT>Mostly Shady</LIGHT>^M
                <PRICE>$9.37</PRICE>^M
                <AVAILABILITY>030699</AVAILABILITY>^M
        </PLANT>^M
                          . . .

The main body of the data is enclosed by the CATALOG tag; we'll sometimes refer to this as a node. Within the CATALOG node are several PLANT nodes - each of them contains several pieces of data which are the information we wish to extract. We're going to use the R XML library with what is known as the Document Object Model (DOM), in which R will read the entire XML file into memory, and internally convert it to a tree structure. (There is another model, called SAX (Simple API for XML), which only reads part of the data at a time, but it's more complicated than the DOM model.) All the information about the data and its structure will be stored in a list inside of R, but when we print it, it will look very much like the document that we started with. The basic strategy of working with XML files under this model is to keep "tunneling down" into the document structure without disturbing its structure until we come to the end of the node. For example, with the plant catalog, we want to extract the PLANT nodes as XML structures, and then extract the values of the terminal branches of the tree (like COMMON, BOTANICAL, and ZONE) using the xmlValue function. As we explore the structure of the data, we can use the names function to see what nodes exist; at any time, we can print a node, and it will show us a representation like the original XML document so we'll know where to go next.

The first step in reading an XML document into R is loading the XML library. Next, the xmlTreeParse function is called to read the document into memory, and the root of the XML tree is extracted using xmlRoot:

> library(XML)
> doc = xmlTreeParse('plant_catalog.xml')
> root = xmlRoot(doc)

Let's look at some properties of this root node:

> class(root)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> names(root)
> names(root)
 [1] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[10] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[19] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[28] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"

In this example, xmlTreeParse read its input from a local file. Other choices are local gzipped files and URLs representing XML documents or gzipped XML documents.

The class of XMLNode indicates that internally, R is storing the object as part of an XML tree. This will be important later when we decide how to process the entire tree. The result of the names function call tells us that the structure of this document is quite simple. Inside the root node, there are 36 PLANT nodes. (For larger documents, it might be prudent to make a table of the names instead of displaying the names directly.) As we've already seen from looking at the document, this is where our data lies, so we'll want to examine one of the plant nodes a little more carefully to figure out how to extract the actual data.

> oneplant = root[[1]]
> class(oneplant)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> oneplant
 <PLANT>
  <COMMON>
  Bloodroot
  </COMMON>
  <BOTANICAL>
  Sanguinaria canadensis
  </BOTANICAL>
  <ZONE>
  4
  </ZONE>
  <LIGHT>
  Mostly Shady
  </LIGHT>
  <PRICE>
  $2.44
  </PRICE>
  <AVAILABILITY>
  031599
  </AVAILABILITY>
 </PLANT>

We can see that this single PLANT object is still an XMLnode; its printed representation shows us exactly what we've got. Notice that the individual elements don't have any further tree structure; this means we can use xmlValue to extract the values:

> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"

Of course, we don't want to perform these tasks one by one for each node of the tree. You may recall that when we were confronted with problems like this before, we used the sapply function, which operates on every element of a list. If the object we want to process is an xmlNode, the corresponding function to use is xmlSApply. For example, to get all the common names of all the plants, we can use xmlValue on all the PLANT nodes like this:

> commons = xmlSApply(root,function(x)xmlValue(x[['COMMON']]))
> head(commons)
                PLANT                 PLANT                 PLANT
          "Bloodroot"           "Columbine"      "Marsh Marigold"
                PLANT                 PLANT                 PLANT
            "Cowslip" "Dutchman's-Breeches"        "Ginger, Wild"

We could repeat the process for each column manually, and then combine things into a data frame, or we can automate the process using lapply and the names of the objects within the plants:

> getvar = function(x,var)xmlValue(x[[var]])
> res = lapply(names(root[[1]]),function(var)xmlSApply(root,getvar,var))
> plants = data.frame(res)
> names(plants) = names(root[[1]])
> head(plants)
               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

2 More Complex Example

For more complex documents, a few other tools are useful. To illustrate, we'll look at a file that uses Geographic Markup Language, or GML. This file (which you can find at http://www.stat.berkeley.edu/~spector/s133/data/counties.gml contains the x- and y-coordinates of the county centers for each state in the United States. Information like this would be difficult to store in a less structured environment, because each state has a different number of counties. If we were going to read it into a database, we might want to have a separate table for each state; in some programs, we'd have to force it into a rectangular form, even though it wouldn't be that efficient. If we were using a spreadsheet, we might want to put all the information in a single spreadsheet, with a separate heading for each state. In R, a reasonable way to store the data would be in a list, with a separate data frame for each state. Thus, providing the data in a form that could be easily converted to any of those formats is essential, and that's just what XML does for us.

The first steps are the same as the previous example; loading the library, parsing the tree, and extracting the root.

> doc = xmlTreeParse('counties.gml')
> root = xmlRoot(doc)

To see what's in the root, we can get a table of the names found there:

> table(names(root))

state
   51

Let's extract the first state node for further study:

> onestate = root[[1]]
> class(onestate)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass"
> table(names(onestate))

county   name
    67      1

Here's what the onestate object looks like - I've truncated it so that it only displays a single county, but we can still see the general structure:

<state>
 <gml:name abbreviation="AL">ALABAMA</gml:name>
 <county>
  <gml:name>Autauga County</gml:name>
  <gml:location>
   <gml:coord>
    <gml:X>-86641472</gml:X>
    <gml:Y>32542207</gml:Y>
   </gml:coord>
  </gml:location>
 </county>
          . . .

The name element (labeled as gml:name) is just the name of the state. We can extract them from all of the states using xmlSApply:

> statenames = xmlSApply(root,function(x)xmlValue(x[['name']]))
> head(statenames)
       state        state        state        state        state        state
   "ALABAMA"     "ALASKA"    "ARIZONA"   "ARKANSAS" "CALIFORNIA"   "COLORADO"

Note that in this example there is an attribute in the name tag, namely the state abbreviation. To access these attributes we can use the xmlAttrs function in a fashion similar to xmlValue:

> stateabbs = xmlSApply(root,function(x)xmlAttrs(x[['name']]))
> head(stateabbs)
state.abbreviation state.abbreviation state.abbreviation state.abbreviation
              "AL"               "AK"               "AZ"               "AR"
state.abbreviation state.abbreviation
              "CA"               "CO"

Since there was only one attribute, xmlSApply was used to extract directly into a vector. If there were multiple attributes, then xmlApply would need to be used, since it will return a list of attributes, preserving the structure of the data.

To process the county data further, we need to extract just the county nodes. The xmlElementsByTagName function will extract just the elements that we want:

> counties = xmlElementsByTagName(onestate,'county')
> class(counties)
[1] "list"
> length(counties)
[1] 67

Notice that extracting the elements in this way results in a list, not an xmlNode. But what's in the list?

> onecounty = counties[[1]]
> class(onecounty)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> names(onecounty)
[1] "name"     "location"
> onecounty
 <county>
  <gml:name>
  Autauga County
  </gml:name>
  <gml:location>
   <gml:coord>
    <gml:X>
    -86641472
    </gml:X>
    <gml:Y>
    32542207
    </gml:Y>
   </gml:coord>
  </gml:location>
 </county>

The elements inside the counties list are still xmlNodes. The fact that they are contained in a list simply means that we'll use sapply or lapply to process them instead of xmlSApply or xmlApply. What we really want are the X and Y values within the coord nodes, so let's extract out those nodes from the full list of county nodes:

> coords = lapply(counties,function(x)x[['location']][['coord']])
> class(coords)
[1] "list"
> class(coords[[1]])
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass"
> coords[[1]]
 <gml:coord>
  <gml:X>
  -86641472
  </gml:X>
  <gml:Y>
  32542207
  </gml:Y>
 </gml:coord>

Since there is only one coord node within the location nodes, I extracted it directly. I could also have used xmlGetElementsByTagName, and used the first element of the resulting list:

coords = lapply(counties,function(x)xmlElementsByTagName(x[['location']],'coord')[[1]])

Notice that I used lapply to extract the coord nodes. Since xmlNodes are represented internally as lists, I would have lost the structure of the data if I used sapply in this case.

Now we can extract the x and y values using xmlValue:

> x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
> y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))

That shows the process for extracting the county names and x- and y-coordinates for a single state. Let's summarize the steps in a function, which we can then apply to the root of the tree to get a separate list for each state:

onestate = function(state){
    counties = xmlElementsByTagName(state,'county')
    countynames = sapply(counties,function(x)xmlValue(x[['name']]))
    coords = lapply(counties,function(x)x[['location']][['coord']])
    x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
    y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))
    data.frame(county=countynames,x=x,y=y)
}

To combine everything together, and create a list with one data frame per state, we can do the following:

> res = xmlApply(root,onestate)
> names(res) = xmlSApply(root,function(x)xmlValue(x[['name']]))

Although these examples may seem simplistic, there are usable XML formats that are this simple, or even simpler. For example, at the web site http://www.weather.gov/data/current_obs/ there is a description of an XML format used to distribute weather information. Basically, the XML document is available at http://www.weather.gov/data/current_obs/XXXX.xml where XXXX represents the four letter weather station code, usually related to a nearby airport.

A quick look at the data reveals that the format is very simple:

> library(XML)
> doc = xmlTreeParse('http://www.weather.gov/data/current_obs/KOAK.xml')
> root = xmlRoot(doc)
> names(root)
 [1] "credit"                  "credit_URL"
 [3] "image"                   "suggested_pickup"
 [5] "suggested_pickup_period" "location"
 [7] "station_id"              "latitude"
 [9] "longitude"               "observation_time"
[11] "observation_time_rfc822" "weather"
[13] "temperature_string"      "temp_f"
[15] "temp_c"                  "relative_humidity"
[17] "wind_string"             "wind_dir"
[19] "wind_degrees"            "wind_mph"
[21] "wind_gust_mph"           "pressure_string"
[23] "pressure_mb"             "pressure_in"
[25] "dewpoint_string"         "dewpoint_f"
[27] "dewpoint_c"              "heat_index_string"
[29] "heat_index_f"            "heat_index_c"
[31] "windchill_string"        "windchill_f"
[33] "windchill_c"             "visibility_mi"
[35] "icon_url_base"           "icon_url_name"
[37] "two_day_history_url"     "ob_url"
[39] "disclaimer_url"          "copyright_url"
[41] "privacy_policy_url"

Essentially, all the information is in the root node, so it can be extracted with xmlValue:

> xmlValue(root[['temp_f']])
[1] "72"
> xmlValue(root[['wind_mph']])
[1] "6.9"

To make this easy to use, we can write a function that will allow us to specify a location and some variables that we want information about:

getweather = function(loc='KOAK',vars='temp_f'){
   require(XML)
   url = paste('http://www.weather.gov/data/current_obs/',loc,'.xml',sep='')
   doc = xmlTreeParse(url)
   root = xmlRoot(doc)
   sapply(vars,function(x)xmlValue(root[[x]]))
}

Let's check to make sure it works with the defaults:

> getweather()
temp_f
  "76"

That seems to be ok. To make it even more useful, we can create a vector of station names, and use sapply to find weather information for each station:

> result = sapply(c('KOAK','KACV','KSDM'),getweather,var=c('temp_f','wind_mph','wind_dir','relative_humidity'))
> data.frame(t(result))
     temp_f wind_mph  wind_dir relative_humidity
KOAK     78      9.2 Northwest                35
KACV     58    17.25     North                75
KSDM     85     8.05      West                14

sapply has properly labeled all the stations and variables.

File translated from T_EX by T_TH, version 3.67.
On 18 Mar 2009, 15:20.