2 Introduction

In this chapter, we introduce the statistical programming language R. We learn how to express computations in the R language, and we take the time now, as we learn the basics, to undersand the computational paradigm of the language. Rather than simply provide the nuts and bolts and code templates, we aim to explain the essential pieces of the language and how to think about R’s computational model. If we begin our introduction to R and go beyond a utilitarian approach to the language, then we can greatly simplify our programming tasks in the future.xxxx

While R is a specific language that you may or may not use extensively in the future, what you learn as we explore R is generally applicable to many different programming languages that you might use. For example, R is very similar to Matlab, and it also shares many of the same concepts as Perl, Python, Java, C, and FORTRAN. Although they are different languages, they share important commonalities that are essential for communicating about computations to others and to the computer.

2.1 Getting Started

We start the R programming environment by launching RStudio or the R graphical user interface (GUI) or by invoking the command R in the shell. If you have not installed R on your computer, you can get help in the first course lab meeting. Once R is running, you should see the prompt, ‘>’ in the console. This is where we type R expressions. Then, when we press the Return/Enter key, R evaluates this expression and prints a value in the console. Try it: enter 1 + 2 at the prompt, i.e.,

1 + 2

Press the Return key, and you should see:

1 + 3
## [1] 4

Then, R presents the prompt again and is ready for us to type another expression in the console.

The expression 1 + 2 is an example of a simple arithmetic computation that R can perform.
We describe these and other types of computations in the next section.
Don’t worry right now about the [1], we explain what it means in the chapter on vectors.

2.2 Computations and Expressions

R provides an interactive environment where we can give an instruction, such as add 1 and 2, then press the Enter key to have the expression immediately evaluated. The answer is printed at the console, and we can repeat this process. This sequence of actions is called a read-evaluate-print loop, or REPL for short. When we type an expression at the prompt in R’s console and press the Enter key, we indicate that we want R to perform the computation. That is, we ask R to evaluate our expression. Here are examples of three kinds of expressions,

2 + 3
sample(7)
hist(precip)

When the first of these three expressions is evaluated, R returns the value 5, which is printed to the console.
For the second expression, we ask R to provide a random ordering of the integers from 1 through 7, and R returns something like

sample(7)
## [1] 7 4 5 2 1 6 3

The third expression doesn’t print any value to the console. Instead, evaluating this expression yields a plot as a side effect, e.g.,

The latter two of these three expressions are function-style expressions. For all three expressions, after the computation has been performed, the loop is complete and the prompt is available for another round of computations.

2.3 Arithmetic Expressions and Order of Operations

With R, we can perform many simple arithmetic expressions, similar to what we can do with a scientific calculator. The following are examples of simple arithmetic expressions:

8 - 9
4 * 5
10 / 3
7 ^ 2
9 %/% 2
11 %% 7

In addition to the basic operations like addition, subtraction, multiplication and division (+, -, *, /), we use ^ for exponentiation, %/% for integer division, and %% for modular arithmetic.

Of course, we can combine these arithmetic operations into more complex expressions. For example,

10 ^ 5 - 6 / 3
## [1] 99998

Recall that \(10^5\) is 100 thousand, and when we subtract 6 divided by 3 (or 2), we get the result shown here.

Order of Operations

The following expression,

10 ^ (5 - 6 / 3)
## [1] 1000

returns a value of 1000, not 99998 which is what 10 ^ 5 - 6 / 3 evaluates to. These two expressions are very similar, except for the addition of parentheses. These parentheses change the order of operations, which is why the return value is so different.
As expected, the order in which operations are performed follows the precedence in algebra, i.e., exponentiation, then multiplication and division, followed by addition and subtraction. These operations are carried out from left to right. However parentheses can override this order. The first expression 10 ^ 5 - 6 / 3 has no parentheses so the first computation is to raise 10 to the power 5. Next is the division of 6 by 3, and lastly 2 is subtracted from \(10^5\). On the other hand, the second expression places parentheses about 5 - 6 / 3 so these computations are performed first. That is, we divide 6 by 3 and subtract the result from 5 to get 3. The final result is \(10^3\) or 1000.

2.4 Call Expressions

We called two functions in our discussion of expressions; these were sample() and hist(). Functions contain code (usually several expressions) that perform a specific task. We provide the functions with inputs to use in carrying out this task. For example, the abs() function takes the absolute value of the input provided,

abs(-0.8)
## [1] 0.8

These inputs are called arguments and the output from the computation is the return value. When we provide a function with a particular set of values for its arguments and press the Return key, we say we are calling or invoking the function. For now, we work with R’s many built-in functions. Later, in the chapter on programming, we write our own functions.

We saw already that when we call sample(7), R returns a random permutation of the numbers from 1 to 7, e.g.,

sample(7)
## [1] 5 2 3 7 1 4 6

The input that we provide the sample() function is the largest integer in the sequence \(1, 2, \ldots\) that we want permuted. However, sample() can take more than one input. It has 4 arguments; these are

args(sample)
## function (x, size, replace = FALSE, prob = NULL) 
## NULL

The arguments to sample() have names, x, size, replace, and prob, and 2 of them, replace and prob, have default values of FALSE and NULL, respectively. This means that they are optional, i.e., we don’t need to provide the inputs for these 2 arguments. If we don’t provide them in our function call, then R simply uses the default values. The size argument does not have a default value, and if it is not supplied then R does the sensible thing–return a permutation of all of the values in x. For example, if we want to sample only 3 random values from the integers from 1 to 7, then we specify x (as always) and we also must specify the size argument. We do this with

sample(7, 3)
## [1] 7 3 4

There are many ways to specify the arguments to a function call. We consider these in “xref linkend=sec:pgmInvoke”, after we learn more about the syntax of the R language.

2.5 Expression Syntax

An expression is an instruction to the computer. The computer evaluates the expressions that we write and returns a value. In order for the software to carry out our instructions, these directions must obey the grammar of the language.

Grammar

The R software uses blanks, commas, parentheses, algebraic and relational operators, and naming conventions to figure out the various parts of an expression and so determine the computation to perform. Importantly, if we understand how the software reads an expression, then we can more easily identify and correct syntax errors.

In the expression below:

round(abs(5 - 2.5^2), digits = 4)
## [1] 1.25

R uses the parentheses, minus and exponentiation signs, comma, and equal sign to identify the variables and functions in this expression. Starting from the inner-most expression, the instructions are to square 2.5, and subtract this quantity from 5. Then, take the absolute value of this difference, and round the resulting value to 4 significant digits.

Readability

We can eliminate the blanks in the expression above and code it as

round(abs(5-2.5^2),digits=4)
## [1] 1.25

R can parse this expression and carry out the instructions, which are the same as in the previous expression. However, we too need to be able to read the expressions that we write, and it’s much easier for us to read code that adopts some of the conventions of written English and places a space after a comma and before and after operators and numbers.

2.6 Variables and Assignment

Previously, we saw that when we provide R with an expression to evaluate, R prints the results to the console as output. But, if we want to save this output for future computations, we can assign the result of the computation to a name, e.g.,

x = 10 ^ 5 - 6 / 3

Now, when we type this expression and hit the return key, R does not print 99998 to the console. Instead, the return value is assigned to a variable named x. That is, the equals sign tells R to assign the result of the computation to the variable x. Now x is a name by which we refer to the value 99998. We can check the value of x by typing x at the prompt and hitting return,

x
## [1] 99998

Here, when we type x at the prompt, the computation that we have asked R to perform is simply to print the value of x to the console. We can also change the value associated with x, by assigning a new value to it. For example,

x = 1 + 3
x
## [1] 4

This is one of two main ways to assign a value to a variable in R. In addition to =, we can also assign a value with <-, i.e.,

x <- 1 + 2
x
## [1] 3

Either of these two forms of assignment can be used. We suggest that you choose one and stick with it. We consistently use = in this book.

Variables allow us to store values without needing to recompute them. Additionally, by storing a value, we reduce redundant calculations, which can help us avoid mistakes. Variables also allow us to write general expressions. For example, the length of the hypotenuse of a right triangle with sides a and b is

sqrt(a^2 + b^2)

Here, we can use this formula over and over for different triangles, by changing the values of a and b and re-evaluating this expression.

R uses “copying” semantics in assignments. That is, when we assign the value of x to y, then y gets the value of x, but the variable y is not linked to x. This means that when x is changed, y does not see that change. In the code below, x begins with the value of 3, then R copies this value in the assignment statement so that y has the value of 3. These two variables are unrelated after that so when we assign x the value of 10, y remains unchanged. Below is the code for this simple example,

x
## [1] 3
y = x
x = 10
x
## [1] 10
y
## [1] 3

Again, y continues to have the value 3 after x’s value has changed.

2.7 Syntax and Parsing

How does R know what computations to perform? It breaks down an expression into parts, called tokens. From these parts, it can figure out what to do. This is very similar to how we read and understand text. When we read, we use punctuation, such as a period, comma, semicolon, quotation marks, etc., as well as blanks and capitalization, to make sense of what’s written. These conventions help us figure out what the person who wrote the text is saying. For example, the blanks, capitalization and punctuation have been stripped from some text that begins,

hatheads...

Without blanks to identify words, and punctuation to identify sentences, phrases and contractions, we don’t know if this text begins as

Ha! The ad's ...

as in “Ha! The ad’s finished”, or

Hat! Heads ...

as in: “Hat! Heads need to keep warm in winter.” The basic conventions that R uses for parsing code are described below.

2.7.1 Tokens

R breaks an expression up into meaningful pieces, called tokens. Similar to English, R uses blanks and quotation marks to identify tokens. Tokens include arithmetic operations and variable names.
In R, the expression 2*3+1 and 2 * 3 + 1 are equivalent. The atomic tokens, * and +, let us know that the number 2 is multiplied by 3 and then 1 is added. The blanks are not needed here, but they make it easier for us to read the expression. On the other hand, the expression sqrt(17) and s qrt(17) are definitely not the same. The blank in the second expression implies that we want to call the qrt() function, not sqrt(). We adopt the convention of placing blanks in expressions to help readability, which we discuss more in xref linkend=“sec:pgmStyle”.

2.7.2 Naming Conventions

Naming conventions for variables also help R parse expressions. Variable names cannot start with a digit or underscore, can contain numbers, upper and lower case letters, and some punctuation. The . and _ are allowed, but most others are not. Also, upper and lower case letters are not the same so X and x refer to different variables.

2.7.3 New Lines

R uses a new line to parse expressions. We produce a new line when we hit the Return (or Enter) key.
In all of the expressions we have written so far, we end the expression when we press the Enter key. Then, R carries out the calculation and returns the value. However, a new line does not always indicate the end of an expression. We can split an expression over multiple lines in the console.
For example,

 10 ^ 5 -
 6 / 3

When typing at the console, the expression appears as

> 10 ^ 5 -
+ 6 / 3

Notice the + in the second line above. This symbol appears rather than the typical ‘>’ prompt to indicate that the expression on the first line is not complete and R is waiting for us to write the rest of the expression on this continuation line. We can even spread this expression over four lines with

> 10 ^
+ 5 -
+ 6 /
+ 3

Each line ends with an arithmetic operator so R continues the expression on the next line and looks for the second number for the operation. e.g., what power to raise 10 to and what number to subtract and divide by.
This is not a very clearly written expression! However, we may at times have long expressions and breaking them up across lines helps with the readability of the code. Of course, if we try to break up our expression after, say, the 6, then we do not get the expected results:

10 ^ 5 - 6
## [1] 99994

Instead of providing us with a continuation line for us to type / 3, R evaluates 10 ^ 5 - 6. This happens because 10 ^ 5 - 6 is a valid R expression so R evaluates it and returns: 99994. Since we have provided R with a valid expression, R can’t discern that we have not finished writing our expression.

Note that we typically do not include the prompts (> and +) in our code display. We do here only to make clear how R parses these expressions.

2.7.4 Compound Expressions

We have seen several simple call expressions, e.g., sample(3, 7) and sqrt(16). A compound call expression is like a compound function in algebra, e.g., \(f(g(x))\). Recall from algebra that the return value from evaluating \(g(x)\) is passed to \(f\) as input to that function. For an example in R, say we want the integer part of the square root of x. Then, we can pass the return value of sqrt(x) to the floor() function; that is,

floor(sqrt(x))
## [1] 3

(Recall that x has the value 10). We can make compound expressions with functions that use more than one argument, e.g, round(sqrt(10), digits = 2) returns 3.16.

2.7.5 Ill-formed Expressions

An ill-formed expression is one that R cannot properly parse. For example,

floor(sqrt(x)]

returns the message

Error: unexpected ']' in "floor(sqrt(x)]"

Here, we have no return value from the computation because R can’t parse our expression. Do you see the mistake? The error message indicates that the right square bracket is unexpected. What does R expect? A right parentheses. If we understand how R parses expressions, then that can help us figure out our mistakes and easily correct them. See whether you can spot the errors in the following expressions:

round(sqrt(10)), digits = 2)

Error: unexpected ',' in "round(sqrt(10)),"


round(sqrt(10, digits = 2))

Error in sqrt(10, digits = 2) : 
  2 arguments passed to 'sqrt' which requires 1

Can you understand what the error message tells us about these ill-formed expressions?

2.8 Review: Types of Expressions

In the examples below, x contains the value 10.

Arithmetic: 2 + x^2

Arithmetic expressions, are composed with the typical operators, e.g., +, -, *, and /, for addition, subtraction, multiplication, and division. The order of evaluation follows the rules of precedence in algebra and parentheses can override. This particular expression evaluates to 102.

Relational: x > 2

Relational expressions use the operators such as ‘<’, ‘<=’, and ‘==’ for less than, less than or equal to, and equal to, respectively.
These expressions evaluate to ‘TRUE’ or ‘FALSE’. This particular expression returns ‘TRUE’.

Boolean: x > 20 | x == 0

Boolean expressions operate on logical values with operators such as & and | and ! for and, or, and not, respectively. This particular expression evaluates to FALSE.

Assignment: y = 2 + x^2

The return value from a computation can be assigned to a variable. This variable can then be used in other expressions. Here y is 102.

Call: sqrt(x)

Expressions can invoke, or call, functions, e.g., this expression computes the square root of x. Functions can have multiple arguments; some arguments may be required, meaning that we must provide a value for them when we call the function. Other arguments may be optional; a default value is provided by the function, and we can override this default in the call. For example, the round() function has an argument named x that has no default value and an argument digits, which has a default of 0. We can assign sqrt(x) to the variable z and then round the value in z to 1 significant digit with round(z, digits = 1). The return value is 3.2. See xref linkend=“sec:pgmInvoke” for more details on how to specify parameter values in a function call.

Compound: round(sqrt(abs(x - y + 5)), 1)

Function calls can be nested, i.e., we can compose functions as in algebra. Here we have nested an arithmetic expression within 3 function calls. The first computation is x - y + 5, which returns -87. Then, this is passed into abs(), which returns 87, and 87 is passed to sqrt() which returns about 9.327, and lastly, this result is the input to the round() function.
The second argument to round() is 1 so 9.3 is returned.