Reading Data into R

1  Reading Data into R

Let's review the methods of reading data into R:
  1. scan - Reads vectors of data which all have the same mode, for example all numbers or all character strings. Each element needs to be separated from the others by one or more blanks, or the sep= argument can be used to specify a different separator. If the first argument to scan is a file, it will read from that file; if the first argument is empty, it reads data from the terminal, ended by a completely blank line. By default, scan expects to see numbers - if you're using it to read strings, use the what=" argument.
    For example, scan be used to create a vector of color names, as an alternative to using the c function:
    > mycolors = scan(,what='')
    1: red blue yellow green orange
    6: 
    Read 5 items
    > mycolors
    [1] "red"    "blue"   "yellow" "green"  "orange"
    
    
    scan can also be used to create numeric matrices, by passing the call to scan to the matrix function. Suppose the file mat.dat looks like this:
    7 19 12 15
    8 9 17 4
    52 12 9 7
    12 9 40 13
    
    
    We could read this matrix into R as follows:
    > mymat = matrix(scan('mat.dat'),ncol=4,byrow=TRUE)
    Read 16 items
    > mymat
         [,1] [,2] [,3] [,4]
    [1,]    7   19   12   15
    [2,]    8    9   17    4
    [3,]   52   12    9    7
    [4,]   12    9   40   13
    
    
    Note the use of byrow=TRUE to overide R's default of reading matrices by columns.
  2. read.table, read.delim, read.csv These functions all allow us to create data frames, where different columns (variables) can be of different modes. read.table is the basic function, and read.csv is a wrapper that sets the separator to a comma, and assumes the first line of the file being read contains the variable names (header=TRUE). read.delim is similar to read.csv, but uses a tab as a separator (sep='
    t'
    ). The first, and only required, argument to these functions is a filename or URL. Some other potentially useful arguments include: If you have problems reading your data, the count.fields function can be useful in finding which lines have the wrong number of values. For example, recall the world data set that we've used previously. If we wanted to check that there were the same number of fields in each line, we could do the following:
    > flds = count.fields('http://www.stat.berkeley.edu/~spector/s133/data/world.txt',sep=',')
    > table(flds)
    flds
      5 
    155 
    
    
    This shows that all 155 records have 5 comma-separated fields. If there were records that didn't match the others, the which function could be used to find their line numbers.
  3. readLines The previous functions all were trying to extract pieces of the input data into separate variables, and will always be the main tool for creating vectors, matrices, and data frames. But if we're just processing text, say, from a web page or from the output of some other computer program, the data won't be arranged in the way that scan or read.table need. In cases like this, we simply want to read the text into R in a way that we can manipulate it to our needs. The R function that does this is readLines. The readLines function accepts a URL or file as its first argument, and it returns a vector with as many elements as there are lines from the input source. Unlike scan or read.table, readLines simply extracts the text from its input source and returns each line as a character string. So whenever we're confronted with "unorganized" data, readLines is the function that will read it into R so that we can manipulate it further. (There's also a similar function called readline that will read exactly one line from standard input.)
    One sometimes useful argument to pass to readLines is the n= argument. With its default of -1, all of the lines in the input will be read; if you provide a number that's greater than 0, readLines will read in only that many lines.
  4. Connections In addition to reading files, R provides a set of functions known as connections. (Type "?connections" in R to learn about all the possibilities.) One very handy type of connection is the textConnection. This allows you to paste text into an R session, just as if it came from a file. For example, recall the previous example of reading a file with scan to create a matrix. We can reproduce that example without needing a file, by cutting and pasting the data into a call to textConnection:
    > thedata = textConnection('7 19 12 15
    + 8 9 17 4
    + 52 12 9 7
    + 12 9 40 13
    + ')
    > mymat = matrix(scan(thedata),ncol=4,byrow=TRUE)
    Read 16 items
    > mymat
         [,1] [,2] [,3] [,4]
    [1,]    7   19   12   15
    [2,]    8    9   17    4
    [3,]   52   12    9    7
    [4,]   12    9   40   13
    
    
    This makes it very easy to run examples that you may find online, and saves the trouble of creating a file when you just need to read in a small amount of data.



File translated from TEX by TTH, version 3.67.
On 8 Feb 2010, 15:12.