5 Reading and Writing Data in Tables

Data often are provided in a simple table-like format in a text file where rows correspond to observations and columns to variables. In this chapter, we examine some typical text file formats for tables, and address how to read a table into R as a data frame. We use several data sets as examples throughout the chapter. These are described below. Each description includes a summary of what constitutes the rows and columns in the data table.

Not all data sets have a simple table-like structure, and in Chapter ?? we tackle the challenge of working with data in more complex formats. Additionally, not all data are available in text files. By text file we mean a file that contains ASCII characters of coding for some human language.

5.1 Examples of Data Tables

Example: The Kaiser Family Data

The Child Health and Development Studies (CHDS) is a comprehensive investigation of all pregnancies that occurred between 1960 and 1967 among women who received prenatal care in the Kaiser Foundation Health Plan in the San Francisco–East Bay area and delivered at any one of the Kaiser hospitals in northern California. Over 15,000 families participated in the CHDS. The babies and their parents were followed through adolescence. The study had several goals, one of which was to examine the effect of the mother smoking during pregnancy on the baby.

  • row: A baby born at the Kaiser facility

  • columns: baby and parent information (e.g., birth weight and length, mother’s age, weight and height) .

Example: Traffic on California Freeways

The freeway Performance Measurement System (PeMS) has a Web site that allows registered users to get information about traffic on California’s freeways. The Web site provides access to data from loop detectors, which are recording devices embedded in the road at various locations in California’s freeway system. We have an extract from PeMS that contains recordings, for each of 3 lanes, of the flow (number of cars) and occupancy (the percentage of time cars were covering the loop) in successive 5 minute intervals over several days at one location.

  • row: Five minute interval in a day.

  • columns: Time of day, and flow and occupancy at one location for each of 3 lanes (in the same direction on the freeway).

Example: Drug Abuse Warning Network Survey

The US federal government conducts several large complex surveys to inform Congress and government agencies on important issues facing the US public and businesses. One such survey is the annual Drug Abuse Warning Network Survey (DAWN), which studies substance-related emergency room visits. Each record in DAWN corresponds to one emergency-room visit for substance-related reasons. For each visit, information is recorded about the patient, including the type of visit (e.g., suicide attempt, adverse reaction, accidental ingestion) and the substances found in the patient.

  • row: Emergency-room visit for a substance-related reason.

  • columns: Patient information (e.g., age, sex, race) and substances present in patient.

Example: World Bank Reports

The World Bank provides financial and technical assistance to developing countries. In 2010, the World Bank launched an Open Data Website that provides access to data from their reports on topics such as GDP, education, health, and the environment.

  • row: A country reported in the World Bank

  • columns: Information for a particular year about the country, such as GDP, birth rate, educational attainment for females in various age groups, and employment to population ratios.

Example: Handwritten Digits

MNIST (Modified National Institute of Standards and Technology) is a collection of images of handwritten digits for developing algorithms to classify digits. The image for each digit is a \(28 \times 28\) grid of pixels. The value for a pixel is an integer between 0 and 255, inclusive, which indicates the lightness or darkness of the pixels with 0 being white or empty and 255 being black. These images have been “hand”-classified so that we know the true number that was written.

  • row: An image of a handwritten digit

  • columns: The values for the 784 pixels that compose the image.

5.2 Text Formats

Our first step in accessing tabular data for exploration and analysis is to determine how the information is organized in the source file. For example, we would want to know whether the data values are separated from each other by commas. In this chapter, we consider 3 standard text formats: delimited, fixed-width, and key-value. We describe each format and give an example of how to read data in that forms. These examples are taken from the datasets introduced in Section 5.1.

Statisticians often work with plain text source files. We can typically determine whether or not the file is plain text from its filename extension. Extensions such as csv, txt, fwf, and dcf correspond to plain text files. These are shorthand for comma-separated values, text, fixed-width format, and debian control format, respectively. Note that filename extensions are only a convention, and they do not guarantee that the contents of the file is plain text.
We can confirm a file is plain text by viewing the contents in a plain text editor, such as Emacs, NotePad++, Sublime, TextWrangler, and Vim. Also, we can confirm the format of the file this way, e.g., we can see whether the values are separated by commas or not. Additionally, the file may be accompanied with documentation that indicates whether or not the contents are plain text and how the information is organized.

Filename extensions, such as xls, rda, and sas, correspond to Excel, R, and SAS binary files, respectively. If we open one of these files with a plain text editor, the contents appear as gibberish. With these files, we typically use the file type’s associated software to read and analyze the data. Data in these formats can sometimes be exchanged between these software programs. For example, the R packages haven and R.matlab provide functions to read SPSS and MATLAB data files into R, respectively.

Data can also be exchanged via binary formats that do not directly map to plain text and that are not specially formatted for particular software tools. In order to read these files, we need precise knowledge of the file format.

5.3 Delimited Data

A typical arrangement of tabular data has one line in the source file correspond to one row in the conceptual data table, which becomes one record in an R data frame. For each line in the file, we need to be able to distinguish the values for the various variables. Delimited data use a delimiter, such as a comma, to separate these values. In addition to the comma, other typical delimiters include the tab, semicolon, and any sort of white space. With these input files, each row is simply divided into pieces at the locations of the delimiters. The file displayed below is an example of comma-delimited data. These data are for the simple family data frame introduced in the Chapter 3. Each row corresponds to a family member, and the values for the person’s first name, sex, age, height, etc. are separated by commas, i.e.,

firstName,sex,age,height,weight,bmi,overWt
Tom,m,77,70,175,25.16239,TRUE   
Maya,f,33,64,124,21.50106,FALSE    
Joe,m,79,73,185,24.45884,FALSE 
Robert,m,47,67,156,24.48414,FALSE    
Sue,f,27,61,98,18.51492,FALSE    
Liz,f,33,68,190,28.94981,TRUE    
...

Notice how the lines in the file have different lengths, depending on the number of characters in the person’s name, the number of digits in weight, and whether the value for the over weight indicator is TRUE or FALSE. R uses the commas in a line to split the text into values for the variables. That is, the value before the 1st comma is assigned to the 1st variable (firstName), the value between the 1st and 2nd commas is the value for the 2nd variable (sex), and so on. Notice the values for sex and weight do not line up from one record to the next, but the value for sex appears after the 1st comma and weight appears after the 4th comma in each row. If, for example, height is missing for Tom, then this can be conveyed by the absence of a value between the 3rd and 4th commas, e.g.,

Tom,m,77,,175,25.16239,TRUE 

Alternatively, there may be a convention adopted for missing values. For example, -99 might indicate that height is missing for that individual, e.g.,

Tom,m,77,-99,175,25.16239,TRUE 

If these kinds of values are used to denote missingness, then we would expect a codebook or some information to be available to tell us about the special values and what they stand for.

Of course, a data value in a comma-delimited file cannot contain a comma, unless it appears within quotations.

5.3.1 White Space Delimited Kaiser Health Data

The file babies.txt contains records from the Child Health and Development Studies (CHDS) described in Section 5.1. The babies.txt file contains a subset of this information collected for 1236 babies. Below is a snippet of the source file:

  id  pluralty outcome date gestation sex bwt parity race age ed ht wt...
  15    5    1 1411  284    1  120    1    8   27    5   62  100...
  20    5    1 1499  282    1  113    2    0   33    5   64  135...
  58    5    1 1576  279    1  128    1    0   28    2   64  115...
  61    5    1 1504  999    1  123    2    0   36    5   69  190...
  72    5    1 1425  282    1  108    1    0   23    5   67  125...
 100    5    1 1673  286    1  136    4    0   25    2   62   93...

The filename extension of txt indicates this is a plain text file. When we examine the contents of the file, we see that the rows have white space between the data values. These could be one or more blanks or tabs. We also notice that the first row of the file is a “header” row that contains variable names.

We don’t need to further investigate the file to determine the exact nature of the separators. We use the read.table() function to read the file with

infants = read.table("data/babies.txt", header = TRUE)

Notice that we do not provide a delimiter value to read.table because the default value is any kind of white space. The header = TRUE argument specifies that the first row of the file contains variable names.

The return value from read.table() is a data frame (see Chapter 4). Let’s examine summary statistics for a few of the variables

summary(infants$wt)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##  87.000000 115.000000 126.000000 153.976537 140.000000 999.000000
summary(infants$pluralty)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       5       5       5       5       5       5

Although we have successfully read the data into a data frame in R, we need to know more about what the values represent to be able to analyze the data. That is, why is the maximum weight 999 and why are all of the values for pluralty 5? (See Section 6.1 for more information about the coding of the data.)

5.3.2 Comma-Delimited Traffic Data

The file flow-occ.txt contains recordings from the loop detector system described earlier. That is, each row in the file corresponds to a recording taken at 3 loop detectors. These measurements are the flow (number of cars) and the occupancy (the percentage of time cars were covering the loop) in a 5-minute interval for each lane. The data contain readings for succesive 5-minute intervals over several days. Below is a snippet of the source file:

'Timestamp','Lane 1 Occ','Lane 1 Flow','Lane 2 Occ',\
 'Lane 2 Flow','Lane 3 Occ','Lane 3 Flow'
3/14/2003 00:00:00,.01,14,.0186,27,.0137,17
3/14/2003 00:05:00,.0133,18,.025,39,.0187,25
3/14/2003 00:10:00,.0088,12,.018,30,.0095,11
3/14/2003 00:15:00,.0115,16,.0203,33,.0217,19
3/14/2003 00:20:00,.0069,8,.0178,25,.0123,13
3/14/2003 00:25:00,.0077,11,.0151,24,.0092,13

The filename extension of txt indicates this is a plain text file. When we examine the contents of the file, we can see that these data are comma-delimited. Notice that the first row of the file is a “header” row that contains variable names. The labels 'Lane 1 Occ' and 'Lane 1 Flow' refer to the occupancy and flow measurements for the leftmost lane on the freeway. Similarly, lane 2 is the center lane, and lane 3 is the farthest right lane. The header line is displayed here across 2 rows in order for it to fit within the page margins (The \ indicates that the line in the file is continued on the following line in the display.)

We have the information required to determine how to read the contents of the text file into a data frame. Specifically, the values are comma separated, and the first line is a header. We use read_delim() in the readr package to read the file with

library(readr)
traffic = read_delim("data/flow-occ.txt", delim = ",")
## Parsed with column specification:
## cols(
##   `'Timestamp'` = col_character(),
##   `'Lane 1 Occ'` = col_double(),
##   `'Lane 1 Flow'` = col_double(),
##   `'Lane 2 Occ'` = col_double(),
##   `'Lane 2 Flow'` = col_double(),
##   `'Lane 3 Occ'` = col_double(),
##   `'Lane 3 Flow'` = col_double()
## )

Notice that we provided 2 arguments to read_delim() – the file name and the delimiter, both as strings. The return value from read_delim() is a special subclass of a data frame called a tibble.

Let’s examine the first few rows of this data frame to confirm that the data are read into R as expected

head(traffic)
## # A tibble: 6 x 7
##   `'Timestamp'` `'Lane 1 Occ'` `'Lane 1 Flow'` `'Lane 2 Occ'`
##   <chr>                  <dbl>           <dbl>          <dbl>
## 1 3/14/2003 00…         0.01                14         0.0186
## 2 3/14/2003 00…         0.0133              18         0.025 
## 3 3/14/2003 00…         0.0088              12         0.018 
## 4 3/14/2003 00…         0.0115              16         0.0203
## 5 3/14/2003 00…         0.0069               8         0.0178
## 6 3/14/2003 00…         0.0077              11         0.0151
## # … with 3 more variables: `'Lane 2 Flow'` <dbl>, `'Lane 3 Occ'` <dbl>,
## #   `'Lane 3 Flow'` <dbl>

The classes of the variables are automatically determined by read_delim(). We see that the occupancy variables are numeric (also known as double-precision, or dbl for short). The Timestamp variable is treated as a character vector. In Chapter ??, we consider whether or not these are appropriate data types for analysis and other issues about the organization of the data, e.g., whether or not to keep 3 separate vectors for occupancy or to stack them into one variable.

5.3.3 Functions for Reading Delimited Data

There are many functions available in R for reading data in csv files; these include read.csv(), read.delim() and read.table() that belong to the utilities provided in R (i.e., there is no need to load a separate package). The readr package also provides the similarly named functions read_csv(), read_delim() and read_table(). We do not go into the details of all of the parameters for each of these functions, but note only that they provide very similar functionality. We note that the functions in readr have a consistent set of parameters, and the return value from a readr function call is a tibbles, which is a special case of a data frame that includes additional functionality when, e.g., printing.

5.4 Fixed Width Format

Another common format for tabular data is fixed width, where the value for a variable appears in the same position in each row in the source file. For example, below is a fixed-width version of the simple family from the Chapter 4:

Tom     m777017525.16239TRUE   
Maya    f336412421.50106FALSE    
Joe     m797318524.45884FALSE 
Robert  m476715624.48414FALSE    
Sue     f2761 9818.51492FALSE    
Liz     f336819028.94981TRUE  

Here, the 1st 8 characters in each line contain the person’s name. When, the name is, say, 3 letters long, then the trailing 5 characters are blank. Similarly, the values for whether or not the person is over weight (TRUE or FALSE) appear in the 25th to 29th characters in each row so a TRUE value has a space after the E. Notice how the values for, say, Tom’s, age, height, weight, and bmi, appear as a string of digits with no separators, e.g., 777017525.16239. In order to pull apart this string of digits into 77 for age, 70 for height, 175 for weight, and 25.16239 for bmi, we need to know that age appears in positions 10-11, height in 12-13, weight in 14-16, and BMI in 17-24. Note that when a weight is only 2 digits, e.g., 98, the weight value has a leading blank so that it takes up 3 positions.

Sue     f2761 9818.51492FALSE

5.4.1 Fixed Width Formatted Drug Abuse Warning Network Survey

The DAWN survey data are in the file 34565-0001-Data.txt. We assume from the file’s extension that it is plain text, but we need to examine the file contents or read the accompanying codebook to determine how the values are organized in the file. Below is a single record from this file, which corresponds to one emergency room visit related to substance abuse.

     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102\
005 1 2 1 2.00-7.00-7.0000-7.0000-7.00001255 105 1142032 4 1\
 1 2.50 5.00 5.0100-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00\
-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7\
.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7\
.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7\
.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  \
-7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7 \
 -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7\
-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-\
7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.\
00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.00\
00-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.00\
00-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.00\
00  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7 \
 -7  -7  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7\
  -7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-\
7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.0\
0-7.00-7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-\
7.0000-7.0000-7.0000  -7  -7  -7  -7-7-7-7-7.00-7.00-7.0000-\
7.0000-7.00008 611001

We can see that this record has a fixed-width format. Since there are no delimiters between values, we must specify the positions for the variable values when reading the data into R. These records are very long, e.g. the one record displayed above takes 21 lines of text to be displayed. The data file comes with a 2,356 page codebook that describes the format, which is essential for determining the locations of the variables.

Figure 5.1 shows screenshots for the codebook entries for the SEX and CASETYPE variables. We read in the codebook that the sex of the patient is provided in columns 36-37. We also see that 1 stands for male, 2 for female, and -8 for not documented. Similarly, the type of visit appears at location 1214 and has 8 possible values.

Screenshot of Codebook Entries for the DAWN Survey

Figure 5.1: Screenshot of Codebook Entries for the DAWN Survey

Screenshot of Codebook Entries for the DAWN Survey

Figure 5.1: Screenshot of Codebook Entries for the DAWN Survey

The codebook entry for a variable provides the variable name, definition, location in the row, possible values and their labels, etc. For the sex variable, these are SEX; GENDER; 36-37; 1, 2, and -8; and male, female, and “not documented”, respectively. The counts for each value are provided to help us check whether or not we have correctly read the data.

We skimmed the codebook and selected several variables to extract for further investigation. To do this, we assign the starting position of each variable to a vector, which we call start; and we put the ending positions in end. We create variable names for each of these variables and assign them to varNames, e.g.,

start = c(1, 9, 11, 14, 15, 30, 34, 36, 38, 45, 46, 66, 68, 70, 
         75, 80, 87, 94, 1214, 1215, 1217, 1218, 1219, 1220, 1221)
end = c(6, 10, 13, 14, 29, 33, 35, 37, 39, 45, 47, 67, 69, 74, 79,
        86, 93, 100, 1214, 1216, 1217, 1218, 1219, 1220, 1221)
varNames = c("id", "strata", "psu", "replicate", "wt", 
             "psuframe", "age", "sex", "race", "daypart", 
             "numsubs", "toxtest", "s1", "s2", "s3", "s4", 
             "s5", "s6", "type", "disp", "alc", "nonalc", 
             "pharma", "nonmedpharma", "allabuse")

We use the fwf_positions() function in readr to collect this information into one object that we pass to the read_fwf() function, i.e.,

library(readr)
pos = fwf_positions(start, end, col_names = varNames)

#dawn = read_fwf("data/ICPSR_34565/DS0001/34565-0001-Data.txt",
#                col_positions = pos)

We will explore and check values for these variables later.

5.5 Key-Value Pairs

We mention one other plain text format: the key-value pair. The idea is that the value for a variable in a record is provided along with its variable name, which is called the key. A typical key-value pair appears as age:77, i.e., the delimiter between the key, age and its value 77 is often a colon. Another common delimiter is the equal sign. A typical “record” contains several lines in the text file with each variable on its own line and a blank line between records. For example, below are the key-value pairs for the first two family members.

firstName:Tom 
sex:m 
age:77 
height:70  
weight:175 
bmi:25.16 
overWt:TRUE

firstName:Maya 
sex:f 
age:33 
height:64 
weight:125 
bmi:21.50 
overWt:FALSE

This particular arrangement of key-value pairs is known as dcf, short for Debian control format. It is commonly used to hold properties lists on UNIX-type operating systems. Another arrangement of key-value pairs has all of the key-value pairs for one record appear on one line with delimiters separating these pairs.

We can read the family data in this dcf format into R with the read.dcf() function as follows:

family.dcf = read.dcf("data/family.dcf")

When we examine the first few values of family.dcf and check the class, we find that all of the values are character strings and the structure is a matrix, not a data frame, i.e.,

head(family.dcf)
##      firstName sex age  height weight bmi     overWt 
## [1,] "Tom"     "m" "77" "70"   "175"  "25.16" "TRUE" 
## [2,] "Maya"    "f" "33" "64"   "125"  "21.50" "FALSE"
## [3,] "Joe"     "m" "79" "73"   "185"  "24.46" "FALSE"
## [4,] "Robert"  "m" "47" "67"   "156"  "24.48" "FALSE"
## [5,] "Sue"     "f" "27" "61"   "98"   "18.51" "FALSE"
## [6,] "Liz"     "f" "33" "68"   "190"  "28.95" "TRUE"
class(family.dcf)
## [1] "matrix"

The matrix class is discussed in more detail in Chapter ??. We can convert this matrix into a data frame for statistical analysis, with functions such as as.data.frame() and as.numeric().

5.6 Writing a Data Frame to a Text File

Often we have read data from a text file into a data frame and cleaned and processed these data into a data frame that we want to use in an analysis. We might choose to save this version of the data frame, rather than having to recreate it each time we want to analyze the data. We can save the infant and traffic data frames in the same R data file with

save(infants, traffic, file = "savingMyData.Rda")

Then when we want to work with them again, we load them with

load("savingMyData.Rda")

Alternatively, we can write the data frame to a text file as a data table. This second approach is particularly convenenient for sharing files.

The reciprocal functions write.csv(), write.delim() and write.table() create plain text files for saving data frames. Below we show an example of writing the infants data frame to a CSV file

write.csv(infants, file = "infants.csv", row.names = FALSE)

This file can be easily read into R with read.csv("infants.csv").

5.7 Summary: Reading Tabular Data

Data are often naturally arranged in a table-like format in plain text files. To read the raw data into R, we need to know how the records and their values are stored in the source file. We can address the following questions to ascertain this information.

  • Is the file plain text? We can often determine this by viewing the file contents in a plain text editor. More simply, we can examine the filename extension (csv, txt, fwf, and dcf are examples of plain text filename extensions) or check any available documentation.

  • Can the data be arranged in a table format, i.e., with rows for observations/records and columns for variables?

  • How are the values for the variables distinguished –
    • Are there delimiters between values (e.g., comma-separated values)?
    • Does a value for a variable appear in the same position in each line in the source file (fixed width format)?
    • Are the values supplied in key-value pairs (with the key as the variable name)?
  • Is there any encoding for special characters?

There are many more options available in the functions shown in this chapter to help us read and write text files. For example, we can provide our own variable names when we read data, we can specify whether a variable should be treated as character or as a factor, we can provide a special format for a variable, and more. We have covered only some of the basics in this chapter.