1 Software for Statistician

R is an interactive and interpreted language designed by statisticians for statisticians. Interactivity is a very useful feature for statisticians. When we work with data, we often want to visualize data, look at numerical summaries, and the output from fitting a model and then decide what to do next. This process has been given the name, Exploratory Data Analysis, or EDA for short (see xref linkend=“sec:EDA”). It is a highly iterative process where we attempt to let the data direct us as to what to do next. We try different things as we go along different branches or paths. Sometimes these lead to useful insights that we want to report. At other times, they verify that certain assumptions are justified, or they suggest trying different methods to better understand the data. The ability to dynamically specify what we want to do next is important. R also allows us to combine commands into a script or “program” that we can re-run on new or different data to recreate our analyses. Running the script is often termed batch programming since we are doing several commands in a single run. This combination of interactive commands and running scripts in a batch gives us the best of both worlds: we use interactive facilities during exploration, and programming facilities when the exploration is more “complete”.

When we say R is an interpreted language, we mean that we can give an instruction and immediately have it evaluated. Then, we can give another command. In non-interpreted languages, we must write an entire program made up of a sequence of commands before we run the code. We have to order the instructions and take account of different possibilities. Once the program is running, we cannot change the commands. All we can do is either wait for it to complete or terminate it and re-run it with the commands altered or different inputs.

There are several features of the R language that are specially designed for how statisticians work with and think about data. Statisticians typically want to work on groups of observations or experimental units, such as our family of 14 individuals that we have used as an example throughout this chapter. Vectors and data frames are natural extensions of this notion. The vector height represents the heights of the family members and we typically want to operate on the entire vector of heights to, e.g., convert the values from inches to centimeters, find the average height, or plot the distribution of height. The philosophy that operations work on an entire vector means we (the users) don’t have to write loops for many operations. Vectorized operations are very convenient.

The data frame is another example of a design feature for statisticians. Although it might appear to be the same as a matrix, it is not. The columns of the data frame are vectors that can be different types. A matrix is rectangular in shape, but all of the elements must be the same primitive type. The data frame is very convenient for working with, e.g., various measurements on a collection of individuals. Our family data frame represents a variety of measurements/variables (name, sex, height, weight, BMI, etc.) on 14 individuals. We have seen already that these variables are not all numeric, e.g., the names are strings and whether or not someone is over weight is a logical.

One of these data types is the factor. As noted earlier in this chapter, this type is somewhat unusual because the values are stored as integers with character labels and we cannot perform algebraic operations on them. This data type is designed for representing qualitative information such as nominal and ordinal variables. Examples of these are sex, marital status, and income class. Statisticians typically analyze qualitative data differently than quantitative variables. For example, it doesn’t make sense to find the average of marital status. Instead, we want the number or the proportion of observations of each type of status. Many functions in R operate differently on a factor variable than a numeric variable. One example, is the summary() function, which produces counts for each level of a factor variable but a minimum, mean, median, and maximum for numeric data.

A lot of what we do in statistics and exploratory data analysis is to look at subgroups of a sample or population. We determine characteristics about that subset and compare them to other groups or the same characteristic of the overall group. Since being able to easily compute subgroups from our data is so important to statisticians, R offers several different ways to specify subsets of data. We saw in this chapter 5 different ways to compute a subset (logical, position, exclusion, name, and all). Additionally, there are specialized functions for taking a subset, such as subset() and filter(), slice() and subset() in the tidyr package.

Working with experimental units and designs we often need to create variables that have repeated patterns of numbers, and the seq() and rep() functions (as well as others such as expand.grid()) can be very useful in this regard.

There are many other features in the language which have been designed with data analysis in mind, e.g. high quality graphics () and more complex data structures ().

1.1 Summary: Why R?

R and its predecessor S were developed by statisticians for data analysis and have many features that help in this process, including

Interactivity: Many statisticians prefer investigating data in an interactive and flexible fashion, where a plot or results from an analysis suggests new paths to take.
Vectorized: The vector is a convenient structure for representing a variable. The vector contains measurements on a set of subjects. Similarly, the data frame is a convenient data structure for holding variables from a study. A data frame is an ordered collection of vectors (variables) where the 1st value in each vector contains the measurements for the 1st subject, and so on.
Data Types: Special data types have been created for nominal (factor) and ordinal (ordered factor) data, and many functions analyze data differently, depending on whether the input is numeric or a factor.
Subsetting: The rich set of methods for taking subsets of data structures enables statisticians to create, examine, and compare subgroups of observations.
Creating Vectors: The methods for creating vectors from sequences and repeated values offers a flexible approach for setting up, e.g., design variables for model fitting.
Programming: The ability to write code that uses control flow and to create functions helps statisticians develop statistical methodology and taylor their data analysis.
Sound Algorithms: Core statistical methods utilize computationally sound algorithms to minimize computational limitations and errors.
Advanced Methods: Statistical researchers contribute packages that implement their latest statistical methodologies for others to use.