Functions

1 Functions

When you create a function, it defines a separate environment and the variables you create inside your function only exist in that function environment; when you return to where you called the function from, those variables no longer exist. You can refer to other objects that are in the calling environment, but if you make any changes to them, the changes will only take place in the function environment. To get information back to the calling environment, you must pass a return value, which will be available through the functions name. R will automatically return the last unassigned value it encounters in your function, or you can place the object you want to return in a call to the return function. You can only return a single object from a function in R; if you need to return multiple objects, you need to return a list containing those objects, and extract them from the list when you return to the calling environment.

As a simple example of a function that returns a value, suppose we want to calculate the ratio of the maximum value of a vector to the minimum value of the vector. Here's a function definition that will do the job:

maxminratio = function(x)max(x)/min(x)

Notice for a single line function you don't need to use brackets ({}) around the function body, but you are free to do so if you like. Since the final statement wasn't assigned to a variable, it will be used as a return value when the function is called. Alternatively, the value could be placed in a call to the return function. If we wanted to find the max to min ratio for all the columns of the matrix, we could use our function with the apply function:

apply(mymat,2,maxminratio)

The 2 in the call to apply tells it to operate on the columns of the matrix; a 1 would be used to work on the rows.

Before we leave this example, it should be pointed out that this function has a weakness - what if we pass it a vector that has missing values? Since we're calling min and max without the na.rm=TRUE argument, we'll always get a missing value if our input data has any missing values. One way to solve the problem is to just put the na.rm=TRUE argument into the calls to min and max. A better way would be to create a new argument with a default value. That way, we still only have to pass one argument to our function, but we can modify the na.rm= argument if we need to.

maxminratio = function(x,na.rm=TRUE)max(x,na.rm=na.rm)/min(x,na.rm=na.rm)

If you look at the function definitions for functions in R, you'll see that many of them use this method of setting defaults in the argument list.

As your functions get longer and more complex, it becomes more difficult to simply type them into an interactive R session. To make it easy to edit functions, R provides the edit command, which will open an editor appropriate to your operating system. When you close the editor, the edit function will return the edited copy of your function, so it's important to remember to assign the return value from edit to the function's name. If you've already defined a function, you can edit it by simply passing it to edit, as in

minmaxratio = edit(minmaxratio)

You may also want to consider the fix function, which automates the process slightly.

To start from scratch, you can use a call to edit like this:

newfunction = edit(function(){})

Suppose we want to write a function that will allow us to calculate the mean of all the appropriate columns of a data frame, broken down by a grouping variable, and and including the counts for the grouping variables in the output. When you're working on developing a function, it's usually easier to solve the problem with a sample data set, and then generalize it to a function. We'll use the movies data frame as an example, with both weekday and month as potential grouping variables. First, let's go over the steps to create the movies data frame with both grouping variables:

> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> movies$box = as.numeric(sub('\\$','',movies$box))
> movies$date = as.Date(movies$date,'%B %d, %Y')
> movies$weekday = weekdays(movies$date)
> movies$weekday = factor(weekdays(movies$date),
+    levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
> movies$month = months(movies$date)
> movies$month = factor(months(movies$date),levels=c('January','February','March',
+          'April','May','June','July','August','September','October','November','December')

Since I've done a fair amount of processing to this data set, and since I'm going to want to use it later for testing my function, I'm going to use the save function to write a copy of the data frame to a file. This function writes out R objects in R's internal format, just like the workspace is saved at the end of an R session. You can also transfer a file produced by save to a different computer, because R uses the same format for its saved objects on all operating systems. Since save accepts a variable number of arguments, we need to specify the file= argument when we call it:

> save(movies,file='movies.rda')

You can use whatever extension you want, but .rda or .Rdata are common choices.

It's often useful to breakdown the steps of a problem like this, and solve each one before going on to the next. Here are the steps we'll need to go through to create our function.

Find the appropriate columns of the data frame for the aggregate function.
Write the call to the aggregate function that will give us the mean for each group.
Write the call to the function to get the counts and convert it to a data frame.
Merge together the results from aggregate and table to give us our result.

To find the appropriate variables, we can examine the class and mode of each column of our data frame:

> sapply(movies,class)
       rank        name         box        date     weekday       month 
  "integer" "character"   "numeric"      "Date"    "factor"    "factor" 
> sapply(movies,mode)
       rank        name         box        date     weekday       month 
  "numeric" "character"   "numeric"   "numeric"   "numeric"   "numeric"

For this data frame, the appropriate variables for aggregation would be rank and box, so we have to come up with some logic that would select only those columns. One easy way is to select those columns whose class is either numeric or integer. We can use the | operator which represents a logical "or" to create a logical vector that will let us select the columns we want. (There's also the & operator which is used to test for a logical "and".)

> classes = sapply(movies,class)
> numcols = classes == 'integer' | classes == 'numeric'

While this will certainly work, R provides an operator that makes expressions like this easier to write. The %in% operator allows us to test for equality to more than one value at a time, without having to do multiple tests. In this example we can use it as follows:

> numcols = sapply(movies,class) %in% c('integer','numeric')

Now we need to write a call to the aggregate function that will find the means for each variable based on a grouping variable. To develop the appropriate call, we'll use weekday as a grouping variable:

> result = aggregate(movies[,numcols],movies['weekday'],mean)
> result
    weekday     rank       box
1    Monday 354.5000 148.04620
2   Tuesday 498.9545 110.42391
3 Wednesday 427.1863 139.38540
4  Thursday 493.7692 117.89700
5    Friday 520.2413 112.44878
6  Saturday 577.5714  91.18714
7    Sunday 338.1818 140.45618

Similarly, we need to create a data frame of counts that can be merged with the result of aggregate:

> counts = as.data.frame(table(movies['weekday']))
> counts
       Var1 Freq
1    Monday   10
2   Tuesday   22
3 Wednesday  161
4  Thursday   39
5    Friday  750
6  Saturday    7
7    Sunday   11

Unfortunately, this doesn't name the first column appropriately for the merge function. The best way to solve this problem is to change the name of the first column of the counts data frame to the name of the grouping variable. Recall that using the sort=FALSE argument to merge will retain the order of the grouping variable that we specified with the levels= argument to factor

> names(counts)[1] = 'weekday'
> merge(counts,result,sort=FALSE)
    weekday Freq     rank       box
1    Monday   10 354.5000 148.04620
2   Tuesday   22 498.9545 110.42391
3 Wednesday  161 427.1863 139.38540
4  Thursday   39 493.7692 117.89700
5    Friday  750 520.2413 112.44878
6  Saturday    7 577.5714  91.18714
7    Sunday   11 338.1818 140.45618

This gives us exactly the result we want, with the columns labeled appropriately.

To convert this to a function, let's put together all the steps we just performed:

> load('movies.rda')
> numcols = sapply(movies,class) %in% c('integer','numeric')
> result = aggregate(movies[,numcols],movies['weekday'],mean)
> counts = as.data.frame(table(movies['weekday']))
> names(counts)[1] = 'weekday'
> merge(counts,result,sort=FALSE)
    weekday Freq     rank       box
1    Monday   10 354.5000 148.04620
2   Tuesday   22 498.9545 110.42391
3 Wednesday  161 427.1863 139.38540
4  Thursday   39 493.7692 117.89700
5    Friday  750 520.2413 112.44878
6  Saturday    7 577.5714  91.18714
7    Sunday   11 338.1818 140.45618

To convert these steps into a function that we could use with any data frame, we need to identify the parts of these statements that would change with different data. In this case, there are two variables that we'd have to change: movies which represents the data frame we're using, and 'weekday' which represents the grouping variable we're using. Here's a function that will perform these operations for any combination of data frame and grouping variable. (I'll change movies to df and 'weekday' to grp to make the names more general, and name the function aggall:

> aggall = function(df,grp){
+    numcols = sapply(df,class) %in% c('integer','numeric')
+    result = aggregate(df[,numcols],df[grp],mean)
+    counts = as.data.frame(table(df[grp]))
+    names(counts)[1] = grp
+    merge(counts,result,sort=FALSE)
+ }

I'm taking advantage of the fact the R functions will return the result of the last statement in the function that's not assigned to a variable, which in this case is the result of the merge function. Alternatively, I could assign the result of the merge function to a variable and use the return function to pass back the result. At this point it would be a good idea to copy our function into a text file so that we can re-enter it into our R session whenever we need it. You could copy the definition from the console, and remove the > prompts, or you could use the history command to cut and paste the function without the prompts. (Other ways of saving the text of the function include the dput and dump functions, or simply making sure you save the workspace before quitting R.) I'll call the text file that I paste the definition into aggall.r.

Now that we have the function written, we need to test it. It's often a good idea to test your functions in a "fresh" R session. We can get the movies data frame back into our workspace by using the load command, passing it the name of the file that we created earlier with the save command, and we can use the source function to read back the definition of the aggall function:

> rm(list=objects())  # removes everything from your workspace !!
> source('aggall.r')
> load('movies.rda')

The first test would be to use the movies data frame, but with a different grouping variable:

> aggall(movies,'month')
       month Freq     rank       box
1    January   40 605.0750  92.41378
2   February   55 621.5818  90.98182
3      March   63 516.1111 107.73638
4      April   47 712.9362  77.45891
5        May   95 338.7474 168.77575
6       June  157 449.3885 125.98888
7       July  121 454.4298 129.67063
8     August   74 562.4865 100.53996
9  September   31 553.2581  99.29284
10   October   54 623.6667  86.12557
11  November  111 441.3423 124.22192
12  December  158 507.9051 117.10242

It seems to work well.

But the real test is to use the function on a different data frame, like the world1 data frame. Let's recreate it:

> world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE)
> conts = read.csv('http://www.stat.berkeley.edu/classes/s133/data/conts.txt',na.string='.')
> world1 = merge(world,conts)
> aggall(world1,'cont')
  cont Freq       gdp    income literacy    military
1   AF   47  2723.404  3901.191 60.52979   356440000
2   AS   41  7778.049  8868.098 84.25122  5006536341
3   EU   34 19711.765 21314.324 98.40294  6311138235
4   NA   15  8946.667        NA 85.52000 25919931267
5   OC    4 14625.000 15547.500 87.50000  4462475000
6   SA   12  6283.333  6673.083 92.29167  2137341667

We're close, but notice that there's an NA for income for one of the continents. Since the movies data frame didn't have any missing values, we overlooked that detail. The most important thing about writing functions is to remember that you can always modify them using either the edit or fix function. For example, if I type

> aggall = edit(aggall)

I'll see the body of the aggall function in a text editor where I can modify it. In this case, I simply need to add the na.rm=TRUE argument to my call to aggregate, resulting in this new version of the function:

function(df,grp){
   numcols = sapply(df,class) %in% c('integer','numeric')
   result = aggregate(df[,numcols],df[grp],mean,na.rm=TRUE)
   counts = as.data.frame(table(df[grp]))
   names(counts)[1] = grp
   merge(counts,result,sort=FALSE)
}

Testing it again on the world1 data frame shows that we have solved the problem:

> aggall(world1,'cont')
  cont Freq       gdp    income literacy    military
1   AF   47  2723.404  3901.191 60.52979   356440000
2   AS   41  7778.049  8868.098 84.25122  5006536341
3   EU   34 19711.765 21314.324 98.40294  6311138235
4   NA   15  8946.667 10379.143 85.52000 25919931267
5   OC    4 14625.000 15547.500 87.50000  4462475000
6   SA   12  6283.333  6673.083 92.29167  2137341667

File translated from T_EX by T_TH, version 3.67.
On 4 Feb 2011, 17:45.