Functions
1 Functions
When you create a function, it defines a separate environment and the variables you
create inside your function only exist in that function environment; when you return
to where you called the function from, those variables no longer exist.
You can refer to other objects that are in the calling environment, but if you make any
changes to them, the changes will only take place in the function environment.
To get information
back to the calling environment, you must pass a return value, which will be available
through the functions name. R will automatically return the last unassigned value it
encounters in your function, or
you can place the object you want to return in a
call to the return function. You can only return a single object from a function
in R; if you need to return multiple objects, you need to return a list containing those
objects, and extract them from the list when you return to the calling environment.
As a simple example of a function that returns a value, suppose we want to calculate the
ratio of the maximum value of a vector to the minimum value of the vector. Here's a
function definition that will do the job:
maxminratio = function(x)max(x)/min(x)
Notice for a single line function you don't need to use brackets ({}) around the
function body, but you are free to do so if you like. Since the final statement wasn't
assigned to a variable, it will be used as a return value when the function is called.
Alternatively, the value could be placed in a call to the return function.
If we wanted to find the max to min ratio for all the columns of the matrix,
we could use our function with the apply function:
apply(mymat,2,maxminratio)
The 2 in the call to apply tells it to operate on the
columns of the matrix; a 1 would be used to work on the rows.
Before we leave this example, it should be pointed out that this function has
a weakness - what if we pass it a vector that has missing values? Since we're
calling min and max without the na.rm=TRUE argument,
we'll always get a missing value if our input data has any missing values.
One way to solve the problem is to just put the na.rm=TRUE argument
into the calls to min and max. A better way would be to
create a new argument with a default value. That way, we still only have to
pass one argument to our function, but we can modify the
na.rm= argument if we need to.
maxminratio = function(x,na.rm=TRUE)max(x,na.rm=na.rm)/min(x,na.rm=na.rm)
If you look at the function definitions for functions in
R, you'll see that many of them use this method of setting defaults in the
argument list.
As your functions get longer and more complex, it becomes more difficult to simply type
them into an interactive R session. To make it easy to edit functions, R provides the
edit command, which will open an editor appropriate to your operating system.
When you close the editor,
the edit function will return the edited copy of your function, so it's important
to remember to assign the return value from edit to the function's name.
If you've already defined a function, you can edit it by simply passing it to
edit, as in
minmaxratio = edit(minmaxratio)
You may also want to consider the fix function, which automates
the process slightly.
To start from scratch, you can use a call to edit like this:
newfunction = edit(function(){})
Suppose we want to write a function that will allow us to calculate
the mean of all the appropriate columns of a data frame,
broken down by a grouping variable, and
and including the counts for the grouping variables in the output.
When you're working on developing a function, it's usually
easier to solve the problem with a sample data set, and then generalize
it to a function.
We'll use the movies data frame as an example, with both weekday
and month as potential grouping variables. First, let's go over the
steps to create the movies data frame with both grouping variables:
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> movies$box = as.numeric(sub('\\$','',movies$box))
> movies$date = as.Date(movies$date,'%B %d, %Y')
> movies$weekday = weekdays(movies$date)
> movies$weekday = factor(weekdays(movies$date),
+ levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
> movies$month = months(movies$date)
> movies$month = factor(months(movies$date),levels=c('January','February','March',
+ 'April','May','June','July','August','September','October','November','December')
Since I've done a fair amount of processing to this data set, and since I'm going
to want to use it later for testing my function, I'm going to use the save function
to write a copy of the data frame to a file. This function writes out R objects in R's
internal format, just like the workspace is saved at the end of an R session. You can also
transfer a file produced by save to a different computer, because R uses the same
format for its saved objects on all operating systems. Since save accepts a variable
number of arguments, we need to specify the file= argument when we call it:
> save(movies,file='movies.rda')
You can use whatever extension you want, but .rda or .Rdata are
common choices.
It's often useful to breakdown the steps of a problem like this, and solve each one
before going on to the next. Here are the steps we'll need to go through to create
our function.
- Find the appropriate columns of the data frame for the aggregate
function.
-
Write the call to the aggregate function that will give us the mean
for each group.
-
Write the call to the function to get the counts and convert it to
a data frame.
-
Merge together the results from aggregate and table to give
us our result.
To find the appropriate variables, we can examine the class and mode
of each column of our data frame:
> sapply(movies,class)
rank name box date weekday month
"integer" "character" "numeric" "Date" "factor" "factor"
> sapply(movies,mode)
rank name box date weekday month
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
For this data frame, the appropriate variables for aggregation would
be rank and box, so we have to come up with some logic that
would select only those columns. One easy way is to select those columns whose
class is either numeric or integer. We can use the |
operator which represents a logical "or" to create a logical vector that will
let us select the columns we want. (There's also the & operator which
is used to test for a logical "and".)
> classes = sapply(movies,class)
> numcols = classes == 'integer' | classes == 'numeric'
While this will certainly work, R provides an operator that makes expressions
like this easier to write. The %in% operator allows us to test for equality
to more than one value at a time, without having to do multiple tests. In this example
we can use it as follows:
> numcols = sapply(movies,class) %in% c('integer','numeric')
Now we need to write a call to the aggregate function that will find the
means for each variable based on a grouping variable. To develop the appropriate call,
we'll use weekday as a grouping variable:
> result = aggregate(movies[,numcols],movies['weekday'],mean)
> result
weekday rank box
1 Monday 354.5000 148.04620
2 Tuesday 498.9545 110.42391
3 Wednesday 427.1863 139.38540
4 Thursday 493.7692 117.89700
5 Friday 520.2413 112.44878
6 Saturday 577.5714 91.18714
7 Sunday 338.1818 140.45618
Similarly, we need to create a data frame of counts that can be
merged with the result of aggregate:
> counts = as.data.frame(table(movies['weekday']))
> counts
Var1 Freq
1 Monday 10
2 Tuesday 22
3 Wednesday 161
4 Thursday 39
5 Friday 750
6 Saturday 7
7 Sunday 11
Unfortunately, this doesn't name the first column appropriately for
the merge function. The best way to solve this problem is to change the
name of the first column of the counts data frame to the name of the
grouping variable.
Recall that using the sort=FALSE argument
to merge will retain the order of the grouping variable that we specified with
the levels= argument to factor
> names(counts)[1] = 'weekday'
> merge(counts,result,sort=FALSE)
weekday Freq rank box
1 Monday 10 354.5000 148.04620
2 Tuesday 22 498.9545 110.42391
3 Wednesday 161 427.1863 139.38540
4 Thursday 39 493.7692 117.89700
5 Friday 750 520.2413 112.44878
6 Saturday 7 577.5714 91.18714
7 Sunday 11 338.1818 140.45618
This gives us exactly the result we want, with the columns labeled
appropriately.
To convert this to a function, let's put together all the steps we just
performed:
> load('movies.rda')
> numcols = sapply(movies,class) %in% c('integer','numeric')
> result = aggregate(movies[,numcols],movies['weekday'],mean)
> counts = as.data.frame(table(movies['weekday']))
> names(counts)[1] = 'weekday'
> merge(counts,result,sort=FALSE)
weekday Freq rank box
1 Monday 10 354.5000 148.04620
2 Tuesday 22 498.9545 110.42391
3 Wednesday 161 427.1863 139.38540
4 Thursday 39 493.7692 117.89700
5 Friday 750 520.2413 112.44878
6 Saturday 7 577.5714 91.18714
7 Sunday 11 338.1818 140.45618
To convert these steps into a function that we could
use with any data frame, we need to identify the parts of these
statements that would change with different data. In this case,
there are two variables that we'd have to change: movies
which represents the data frame we're using, and 'weekday'
which represents the grouping variable we're using. Here's a
function that will perform these operations for any combination
of data frame and grouping variable. (I'll change movies
to df and 'weekday' to grp to make the
names more general, and name the function aggall:
> aggall = function(df,grp){
+ numcols = sapply(df,class) %in% c('integer','numeric')
+ result = aggregate(df[,numcols],df[grp],mean)
+ counts = as.data.frame(table(df[grp]))
+ names(counts)[1] = grp
+ merge(counts,result,sort=FALSE)
+ }
I'm taking advantage of the fact the R functions will return
the result of the last statement in the function that's not assigned
to a variable, which in this case is the result of the merge
function. Alternatively, I could assign the result of the merge function
to a variable and use the return function
to pass back the result. At this point it would be a good idea to
copy our function into a text file so that we can re-enter it into our
R session whenever we need it. You could copy the definition from the
console, and remove the > prompts, or you could use the
history command to cut and paste the function without the
prompts. (Other ways of saving the text of the function include the
dput and dump functions, or simply making sure you save
the workspace before quitting R.) I'll call the text file that I
paste the definition into aggall.r.
Now that we have the function written, we need to test it.
It's often a good idea to test your functions in a "fresh" R session.
We can get the movies data frame back into our workspace by
using the load command, passing it the name of the file that
we created earlier with the save command, and we can use
the source function to read back the definition of the
aggall function:
> rm(list=objects()) # removes everything from your workspace !!
> source('aggall.r')
> load('movies.rda')
The first
test would be to use the movies data frame, but with a different
grouping variable:
> aggall(movies,'month')
month Freq rank box
1 January 40 605.0750 92.41378
2 February 55 621.5818 90.98182
3 March 63 516.1111 107.73638
4 April 47 712.9362 77.45891
5 May 95 338.7474 168.77575
6 June 157 449.3885 125.98888
7 July 121 454.4298 129.67063
8 August 74 562.4865 100.53996
9 September 31 553.2581 99.29284
10 October 54 623.6667 86.12557
11 November 111 441.3423 124.22192
12 December 158 507.9051 117.10242
It seems to work well.
But the real test is to use the function on a different data frame,
like the world1 data frame. Let's recreate it:
> world = read.csv('http://www.stat.berkeley.edu/classes/s133/data/world.txt',header=TRUE)
> conts = read.csv('http://www.stat.berkeley.edu/classes/s133/data/conts.txt',na.string='.')
> world1 = merge(world,conts)
> aggall(world1,'cont')
cont Freq gdp income literacy military
1 AF 47 2723.404 3901.191 60.52979 356440000
2 AS 41 7778.049 8868.098 84.25122 5006536341
3 EU 34 19711.765 21314.324 98.40294 6311138235
4 NA 15 8946.667 NA 85.52000 25919931267
5 OC 4 14625.000 15547.500 87.50000 4462475000
6 SA 12 6283.333 6673.083 92.29167 2137341667
We're close, but notice that there's an NA for
income for one of the continents. Since the movies
data frame didn't have any missing values, we overlooked that detail.
The most important thing about writing functions is to remember that
you can always modify them using either the edit or fix
function. For example, if I type
> aggall = edit(aggall)
I'll see the body of the aggall function in a text
editor where I can modify it. In this case, I simply need to add the
na.rm=TRUE argument to my call to aggregate, resulting
in this new version of the function:
function(df,grp){
numcols = sapply(df,class) %in% c('integer','numeric')
result = aggregate(df[,numcols],df[grp],mean,na.rm=TRUE)
counts = as.data.frame(table(df[grp]))
names(counts)[1] = grp
merge(counts,result,sort=FALSE)
}
Testing it again on the world1 data frame shows that
we have solved the problem:
> aggall(world1,'cont')
cont Freq gdp income literacy military
1 AF 47 2723.404 3901.191 60.52979 356440000
2 AS 41 7778.049 8868.098 84.25122 5006536341
3 EU 34 19711.765 21314.324 98.40294 6311138235
4 NA 15 8946.667 10379.143 85.52000 25919931267
5 OC 4 14625.000 15547.500 87.50000 4462475000
6 SA 12 6283.333 6673.083 92.29167 2137341667
File translated from
TEX
by
TTH,
version 3.67.
On 4 Feb 2011, 17:45.