Doing calculations on subgroups of data

Next: Distribution-related functions Up: Database type operations Previous: Sorting matrices and dataframes

Doing calculations on subgroups of data

Suppose you want to calculate the mean within each group, where the group is defined by a column (field) in your dataframe. First use the split() function to split the data by group, with the output being a list. Then use lapply() to perform the calculation on each element of the list. You can also create discrete groups from a continuous variable using the cut() function.

Here's an example. Suppose I have a vector housePrice and a vector income where the observations are the house price and income for a number of households. To calculate the median house price for people with similar incomes, I can do the following:
categories=cut(income,breaks=c(seq(0,100000,by=10000),500000))
groupedPrices=split(housePrice,categories)
meanPrice=unlist(lapply(groupedPrices,mean))

You can do something similar using the aggregate function:
categories=cut(income,breaks=c(seq(0,100000,by=10000),500000))
meanPrice=aggregate(housePrice,by=list(categories),FUN='mean

I haven't used it, by the reshape package looks to be useful for manipulating the dimensions of datasets.

Last modified: 12/13/08.

Chris Paciorek 2012-01-21