XML

1 What is XML?

XML, which is an acronym for extensible markup language, is a way of creating documents that contain structured information. Up until now, most of the data sets that we've considered were easily accomodated by a observations and variables rectangular data structure (like a comma separated file, which translates neatly to a data frame in R), where the number of variables for each observations is the same. Outside of variable names and the actual data values, there was no additional information stored in the data, either in its text form or once it was read into R.

One type of data that we've seen that accomodates extra information is the spreadsheet; people often annotate their data will all sorts of useful and maybe-not-so-useful information. But as we've seen, there's no standard structure to these non-data rows of the spreadsheet, so they really just get in our way and require customized processing.

Another example of non-regular data would be the baseball database; players played for varying numbers of seasons, and for varying numbers of teams; they may or may not have attended college, or played in a world series, etc. Databases have traditionally been used for data like this, but they certainly have their disadvantages: there are many tables, making it necessary to perform lots of joins, and it's easy to lose track of all that information.

One frustrating aspect regarding designing formats for data is that if you go into the design with a specific type of data in mind, you'll build in features that are wonderful for that data, but may be very lacking for other sorts of data. For example, statisticians would want a rich syntax to describe things like repeated measures, missing values, allowable ranges of variables, etc., but these features would not be of much use to a general user. So in some ways, we can think of XML as a compromise that will allow a wide variety of different information to be stored and distributed in a similar way without favoring any particular structure or type of data. The features that I've described can easily be accomodated in an XML document by simply defining new tags to contain this information. Furthermore, XML is designed so that it's alright if there is no extra information in the document other than the data and its' associated tags. This sort of flexibility is one of the most powerful forces behind the development of XML. To see a list of some of the applications for which XML document styles have been proposed, go to http://xml.coverpages.org/xml.html#applications.

At first glance, XML looks very similar to HTML, but there are important differences. First, HTML is used to describe the desired appearance of information in the browser, not the structure of that information. Second, the set of allowable tags in HTML is fixed; we can't just redefine them at our convenience. HTML is also inconsistent. Tags like h1 and td must have accompanying closing tags (</h1> and </td>), but tags like <br> and <p> don't. Finally, to accomodate all the people creating web pages and all the browsers trying to display them, HTML doesn't lay down strict rules about the way information is put into a web page. These facts about HTML should not be interpreted as saying the HTML is not useful. It's very useful for what it does, but it doesn't give us a consistent, extensible format for accomodating data that preserves its structure.

Here are some of the things that we'll see in XML documents:

XML documents always begin with
```
<?xml ....?>
```
Information is identified by tags (i.e. an indentifier surrounded by angle brackets, like
```
<coursename>
   Statistics 133
</coursename>
```
Every opening tag has a corresponding tag, except for the case where there is no information; then a single tag like <sometag/> can be used. The specific names of these tags will vary with each XML document.
Tags are case sensitive, and opening and closing tags must match exactly.
No spaces are allowed between the opening angle bracket and the tag name
Tag names must begin with a letter, and can contain only letters and numbers.
As in HTML, angle brackets will always be interpreted as parts of tags; to get literal angle brackets in data we need to use < and >.
Additional information can be stored in the tags in the form of attributes. For example, in the previous example, we could have specified the department as:
```
<coursename department='Statistics'>
     Statistics 133
</coursename>
```
or
```
<course>
   <name>
      Statistics 133
   </name>
   <department>
      Statistics
   </department>
</course>
```
The decision is up to the designer of the particular document.
Comments can be placed in an XML document by starting with <!- and ending with ->, and they may appear anywhere in the document.

Documents that conform to these rules are said to be well-formed, and this is the minimum requirement for any XML document. But to be useful, they should also conform to additional rules described in what is known as the Document Type Definitions for a particular document, or the DTD. The DTD describes the allowable tags, what kind of data can be stored within those tags, and what attributes are allowed in the tags. The DTD for a particular document can be provided in the document itself; when a document uses multiple DTDs, the tag name may be prefaced with an identifier followed by a colon to make clear which DTD the tag comes from. The programs that read XML are known as parsers; without a DTD they simply make sure that the document is well-formed; with a DTD they can additionally check to make sure that the data is valid.

2 A Simple Example

Let's start with a very simple example to introduce the XML library and some of the strategies we'll need to use when working with XML. The example we'll use is a catalog of plants. (You can find the XML document at http://www.stat.berkeley.edu/classes/s133/data/plant_catalog.xml.) Here's what the beginning of the file looks like:

<?xml version="1.0" encoding="ISO-8859-1"?>^M
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->^M
<CATALOG>^M
        <PLANT>^M
                <COMMON>Bloodroot</COMMON>^M
                <BOTANICAL>Sanguinaria canadensis</BOTANICAL>^M
                <ZONE>4</ZONE>^M
                <LIGHT>Mostly Shady</LIGHT>^M
                <PRICE>$2.44</PRICE>^M
                <AVAILABILITY>031599</AVAILABILITY>^M
        </PLANT>^M
        <PLANT>^M
                <COMMON>Columbine</COMMON>^M
                <BOTANICAL>Aquilegia canadensis</BOTANICAL>^M
                <ZONE>3</ZONE>^M
                <LIGHT>Mostly Shady</LIGHT>^M
                <PRICE>$9.37</PRICE>^M
                <AVAILABILITY>030699</AVAILABILITY>^M
        </PLANT>^M
                          . . .

The main body of the data is enclosed by the CATALOG tag; we'll sometimes refer to this as a node. Within the CATALOG node are several PLANT nodes - each of them contains several pieces of data which are the information we wish to extract. We're going to use the R XML library with what is known as the Document Object Model (DOM), in which R will read the entire XML file into memory, and internally convert it to a tree structure. (There is another model, called SAX (Simple API for XML), which only reads part of the data at a time, but it's more complicated than the DOM model.) All the information about the data and its structure will be stored in a list inside of R, but when we print it, it will look very much like the document that we started with. The basic strategy of working with XML files under this model is to keep "tunneling down" into the document structure without disturbing its structure until we come to the end of the node. For example, with the plant catalog, we want to extract the PLANT nodes as XML structures, and then extract the values of the terminal branches of the tree (like COMMON, BOTANICAL, and ZONE) using the xmlValue function. As we explore the structure of the data, we can use the names function to see what nodes exist; at any time, we can print a node, and it will show us a representation like the original XML document so we'll know where to go next.

The first step in reading an XML document into R is loading the XML library. Next, the xmlTreeParse function is called to read the document into memory, and the root of the XML tree is extracted using xmlRoot:

> library(XML)
> doc = xmlTreeParse('plant_catalog.xml')
> root = xmlRoot(doc)

Let's look at some properties of this root node:

> class(root)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> names(root)
> names(root)
 [1] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[10] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[19] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"
[28] "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT" "PLANT"

In this example, xmlTreeParse read its input from a local file. Other choices are local gzipped files and URLs representing XML documents or gzipped XML documents.

The class of XMLNode indicates that internally, R is storing the object as part of an XML tree. This will be important later when we decide how to process the entire tree. The result of the names function call tells us that the structure of this document is quite simple. Inside the root node, there are 36 PLANT nodes. (For larger documents, it might be prudent to make a table of the names instead of displaying the names directly.) As we've already seen from looking at the document, this is where our data lies, so we'll want to examine one of the plant nodes a little more carefully to figure out how to extract the actual data.

> oneplant = root[[1]]
> class(oneplant)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> oneplant
 <PLANT>
  <COMMON>
  Bloodroot
  </COMMON>
  <BOTANICAL>
  Sanguinaria canadensis
  </BOTANICAL>
  <ZONE>
  4
  </ZONE>
  <LIGHT>
  Mostly Shady
  </LIGHT>
  <PRICE>
  $2.44
  </PRICE>
  <AVAILABILITY>
  031599
  </AVAILABILITY>
 </PLANT>

We can see that this single PLANT object is still an XMLnode; its printed representation shows us exactly what we've got. Notice that the individual elements don't have any further tree structure; this means we can use xmlValue to extract the values:

> xmlValue(oneplant[['COMMON']])
[1] "Bloodroot"
> xmlValue(oneplant[['BOTANICAL']])
[1] "Sanguinaria canadensis"

Of course, we don't want to perform these tasks one by one for each node of the tree. You may recall that when we were confronted with problems like this before, we used the sapply function, which operates on every element of a list. If the object we want to process is an xmlNode, the corresponding function to use is xmlSApply. For example, to get all the common names of all the plants, we can use xmlValue on all the PLANT nodes like this:

> commons = xmlSApply(root,function(x)xmlValue(x[['COMMON']]))
> head(commons)
                PLANT                 PLANT                 PLANT
          "Bloodroot"           "Columbine"      "Marsh Marigold"
                PLANT                 PLANT                 PLANT
            "Cowslip" "Dutchman's-Breeches"        "Ginger, Wild"

We could repeat the process for each column manually, and then combine things into a data frame, or we can automate the process using lapply and the names of the objects within the plants:

> getvar = function(x,var)xmlValue(x[[var]])
> res = lapply(names(root[[1]]),function(var)xmlSApply(root,getvar,var))
> plants = data.frame(res)
> names(plants) = names(root[[1]])
> head(plants)
               COMMON              BOTANICAL ZONE        LIGHT PRICE
1           Bloodroot Sanguinaria canadensis    4 Mostly Shady $2.44
2           Columbine   Aquilegia canadensis    3 Mostly Shady $9.37
3      Marsh Marigold       Caltha palustris    4 Mostly Sunny $6.81
4             Cowslip       Caltha palustris    4 Mostly Shady $9.90
5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady $6.44
6        Ginger, Wild       Asarum canadense    3 Mostly Shady $9.03
  AVAILABILITY
1       031599
2       030699
3       051799
4       030699
5       012099
6       041899

3 More Complex Example

For more complex documents, a few other tools are useful. To illustrate, we'll look at a file that uses Geographic Markup Language, or GML. This file (which you can find at http://www.stat.berkeley.edu/classes/s133/data/counties.gml contains the x- and y-coordinates of the county centers for each state in the United States. Information like this would be difficult to store in a less structured environment, because each state has a different number of counties. If we were going to read it into a database, we might want to have a separate table for each state; in some programs, we'd have to force it into a rectangular form, even though it wouldn't be that efficient. If we were using a spreadsheet, we might want to put all the information in a single spreadsheet, with a separate heading for each state. In R, a reasonable way to store the data would be in a list, with a separate data frame for each state. Thus, providing the data in a form that could be easily converted to any of those formats is essential, and that's just what XML does for us.

The first steps are the same as the previous example; loading the library, parsing the tree, and extracting the root.

> doc = xmlTreeParse('counties.gml')
> root = xmlRoot(doc)

To see what's in the root, we can get a table of the names found there:

> table(names(root))

state
   51

Let's extract the first state node for further study:

> onestate = root[[1]]
> class(onestate)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass"
> table(names(onestate))

county   name
    67      1

Here's what the onestate object looks like - I've truncated it so that it only displays a single county, but we can still see the general structure:

<state>
 <gml:name abbreviation="AL">ALABAMA</gml:name>
 <county>
  <gml:name>Autauga County</gml:name>
  <gml:location>
   <gml:coord>
    <gml:X>-86641472</gml:X>
    <gml:Y>32542207</gml:Y>
   </gml:coord>
  </gml:location>
 </county>
          . . .

The name element (labeled as gml:name) is just the name of the state. We can extract them from all of the states using xmlSApply:

> statenames = xmlSApply(root,function(x)xmlValue(x[['name']]))
> head(statenames)
       state        state        state        state        state        state
   "ALABAMA"     "ALASKA"    "ARIZONA"   "ARKANSAS" "CALIFORNIA"   "COLORADO"

Note that in this example there is an attribute in the name tag, namely the state abbreviation. To access these attributes we can use the xmlAttrs function in a fashion similar to xmlValue:

> stateabbs = xmlSApply(root,function(x)xmlAttrs(x[['name']]))
> head(stateabbs)
state.abbreviation state.abbreviation state.abbreviation state.abbreviation
              "AL"               "AK"               "AZ"               "AR"
state.abbreviation state.abbreviation
              "CA"               "CO"

Since there was only one attribute, xmlSApply was used to extract directly into a vector. If there were multiple attributes, then xmlApply would need to be used, since it will return a list of attributes, preserving the structure of the data.

To process the county data further, we need to extract just the county nodes. The xmlElementsByTagName function will extract just the elements that we want:

> counties = xmlElementsByTagName(onestate,'county')
> class(counties)
[1] "list"
> length(counties)
[1] 67

Notice that extracting the elements in this way results in a list, not an xmlNode. But what's in the list?

> onecounty = counties[[1]]
> class(onecounty)
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass" 
> names(onecounty)
[1] "name"     "location"
> onecounty
 <county>
  <gml:name>
  Autauga County
  </gml:name>
  <gml:location>
   <gml:coord>
    <gml:X>
    -86641472
    </gml:X>
    <gml:Y>
    32542207
    </gml:Y>
   </gml:coord>
  </gml:location>
 </county>

The elements inside the counties list are still xmlNodes. The fact that they are contained in a list simply means that we'll use sapply or lapply to process them instead of xmlSApply or xmlApply. What we really want are the X and Y values within the coord nodes, so let's extract out those nodes from the full list of county nodes:

> coords = lapply(counties,function(x)x[['location']][['coord']])
> class(coords)
[1] "list"
> class(coords[[1]])
[1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode"  "oldClass"
> coords[[1]]
 <gml:coord>
  <gml:X>
  -86641472
  </gml:X>
  <gml:Y>
  32542207
  </gml:Y>
 </gml:coord>

Since there is only one coord node within the location nodes, I extracted it directly. I could also have used xmlGetElementsByTagName, and used the first element of the resulting list:

coords = lapply(counties,function(x)xmlElementsByTagName(x[['location']],'coord')[[1]])

Notice that I used lapply to extract the coord nodes. Since xmlNodes are represented internally as lists, I would have lost the structure of the data if I used sapply in this case.

Now we can extract the x and y values using xmlValue:

> x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
> y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))

That shows the process for extracting the county names and x- and y-coordinates for a single state. Let's summarize the steps in a function, which we can then apply to the root of the tree to get a separate list for each state:

onestate = function(state){
    counties = xmlElementsByTagName(state,'county')
    countynames = sapply(counties,function(x)xmlValue(x[['name']]))
    coords = lapply(counties,function(x)x[['location']][['coord']])
    x = as.numeric(sapply(coords,function(x)xmlValue(x[['X']])))
    y = as.numeric(sapply(coords,function(x)xmlValue(x[['Y']])))
    data.frame(county=countynames,x=x,y=y)
}

To combine everything together, and create a list with one data frame per state, we can do the following:

> res = xmlApply(root,onestate)
> names(res) = xmlSApply(root,function(x)xmlValue(x[['name']]))

Although these examples may seem simplistic, there are usable XML formats that are this simple, or even simpler. For example, at the web site http://www.weather.gov/data/current_obs/ there is a description of an XML format used to distribute weather information. Basically, the XML document is available at http://www.weather.gov/data/current_obs/XXXX.xml where XXXX represents the four letter weather station code, usually related to a nearby airport.

A quick look at the data reveals that the format is very simple:

> library(XML)
> doc = xmlTreeParse('http://www.weather.gov/data/current_obs/KOAK.xml')
> root = xmlRoot(doc)
> names(root)
 [1] "credit"                  "credit_URL"
 [3] "image"                   "suggested_pickup"
 [5] "suggested_pickup_period" "location"
 [7] "station_id"              "latitude"
 [9] "longitude"               "observation_time"
[11] "observation_time_rfc822" "weather"
[13] "temperature_string"      "temp_f"
[15] "temp_c"                  "relative_humidity"
[17] "wind_string"             "wind_dir"
[19] "wind_degrees"            "wind_mph"
[21] "wind_gust_mph"           "pressure_string"
[23] "pressure_mb"             "pressure_in"
[25] "dewpoint_string"         "dewpoint_f"
[27] "dewpoint_c"              "heat_index_string"
[29] "heat_index_f"            "heat_index_c"
[31] "windchill_string"        "windchill_f"
[33] "windchill_c"             "visibility_mi"
[35] "icon_url_base"           "icon_url_name"
[37] "two_day_history_url"     "ob_url"
[39] "disclaimer_url"          "copyright_url"
[41] "privacy_policy_url"

Essentially, all the information is in the root node, so it can be extracted with xmlValue:

> xmlValue(root[['temp_f']])
[1] "49.0"
> xmlValue(root[['wind_mph']])
[1] "12.7"

To make this easy to use, we can write a function that will allow us to specify a location and some variables that we want information about:

getweather = function(loc='KOAK',vars='temp_f'){
   require(XML)
   url = paste('http://www.weather.gov/data/current_obs/',loc,'.xml',sep='')
   doc = xmlTreeParse(url)
   root = xmlRoot(doc)
   sapply(vars,function(x)xmlValue(root[[x]]))
}

Let's check to make sure it works with the defaults:

> getweather()
temp_f 
"49.0"

That seems to be ok. To make it even more useful, we can create a vector of station names, and use sapply to find weather information for each station:

> result = sapply(c('KOAK','KACV','KSDM'),getweather,vars=c('temp_f','wind_mph','wind_dir','relative_humidity'))
> data.frame(t(result))
     temp_f wind_mph  wind_dir relative_humidity
KOAK   49.0     12.7     South                93
KACV   45.0      6.9     South                86
KSDM   66.0      6.9 Northwest                42

sapply has properly labeled all the stations and variables.

If information was recorded at different times, the observation_time_rfc822 variable could be converted to an R POSIXct object as follows:

> as.POSIXct(getweather('KOAK','observation_time_rfc822'),format='%a, %d %b %Y %H:%M:%S %z')
[1] "2011-03-18 14:53:00 PDT"

File translated from T_EX by T_TH, version 3.67.
On 18 Mar 2011, 15:32.