Consider the problem of removing rows from a data frame where any value in the row has an NA, i.e. there is any incomplete data.
Here are four different functions for doing this.
This first function, creates the output data frame one row at a time, that is it adds a new row to the object res each time it finds a row that has no NAs.
funAgg = function(x) { # initialize res res = NULL n = nrow(x) for (i in 1:n) { if (!any(is.na(x[i,]))) res = rbind(res, x[i,]) } res }
An alternative here is to initialise the res object to the correct size (or roughly the correct size). Then replace the rows one at a time as a row in the input with no NAs is found. Notice that we have to keep track of where we are in the output
funLoop = function(x) { # Initialize res with x res = x n = nrow(x) k = 1 for (i in 1:n) { if (!any(is.na(x[i,]))) { res[k, ] = x[i,] k = k + 1 } } res[1:(k-1), ] }
Our next attempt, uses the property of the is.na function (i.e. that it returns a matrix of logicals when given a data frame) and it employs the apply function. It is very simple.
funApply = function(x) { drop = apply(is.na(x), 1, any) x[!drop, ] }
This function is what is used in R when you call omit.na.
funOmit = function(x) { # The or operation is very fast, it is replacing the any function # Also note that it doesn't require having another data frame as big as x drop = F n = ncol(x) for (i in 1:n) drop = drop | is.na(x[, i]) x[!drop, ] }
#Make up large test case xx = matrix(rnorm(2000000),100000,20) xx[xx>2] = NA x = as.data.frame(xx) # Call the R code profiler and give it an output file to hold results Rprof("exampleAgg.out") # Call the function to be profiled y = funAgg(xx) Rprof(NULL) Rprof("exampleLoop.out") y = funLoop(xx) Rprof(NULL) Rprof("exampleApply.out") y = funApply(xx) Rprof(NULL) Rprof("exampleOmit.out") y = funOmit(xx) Rprof(NULL)
The output from the profiling.
The function funAgg - interrupted before completion
Each sample represents 0.02 seconds.
Total run time: 73.9000000000024 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self total seconds self seconds name 100.00 73.90 0.32 0.24 "funAgg" 99.16 73.28 99.16 73.28 "rbind" 0.46 0.34 0.38 0.28 "any" 0.08 0.06 0.08 0.06 "is.na" 0.05 0.04 0.05 0.04 "!" % self % total self seconds total seconds name 99.16 73.28 99.16 73.28 "rbind" 0.38 0.28 0.46 0.34 "any" 0.32 0.24 100.00 73.90 "funAgg" 0.08 0.06 0.08 0.06 "is.na" 0.05 0.04 0.05 0.04 "!"
The Function funLoop
Each sample represents 0.02 seconds.
Total run time: 3.56 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self total seconds self seconds name 100.00 3.56 50.56 1.80 "funLoop" 46.07 1.64 39.89 1.42 "any" 6.18 0.22 6.18 0.22 "is.na" 2.25 0.08 2.25 0.08 "!" 1.12 0.04 1.12 0.04 "+" % self % total self seconds total seconds name 50.56 1.80 100.00 3.56 "funLoop" 39.89 1.42 46.07 1.64 "any" 6.18 0.22 6.18 0.22 "is.na" 2.25 0.08 2.25 0.08 "!" 1.12 0.04 1.12 0.04 "+"
The function funApply
Each sample represents 0.02 seconds.
Total run time: 3.94 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self total seconds self seconds name 100.00 3.94 2.54 0.10 "funApply" 97.46 3.84 46.70 1.84 "apply" 23.86 0.94 23.86 0.94 "FUN" 18.27 0.72 1.52 0.06 "unlist" 17.26 0.68 0.00 0.00 "any" 16.75 0.66 15.74 0.62 "lapply" 6.60 0.26 6.60 0.26 "aperm" 2.03 0.08 2.03 0.08 "is.na" 1.02 0.04 1.02 0.04 "names<-.default" 1.02 0.04 0.00 0.00 "names<-" % self % total self seconds total seconds name 46.70 1.84 97.46 3.84 "apply" 23.86 0.94 23.86 0.94 "FUN" 15.74 0.62 16.75 0.66 "lapply" 6.60 0.26 6.60 0.26 "aperm" 2.54 0.10 100.00 3.94 "funApply" 2.03 0.08 2.03 0.08 "is.na" 1.52 0.06 18.27 0.72 "unlist" 1.02 0.04 1.02 0.04 "names<-.default"
The function funOmit
Each sample represents 0.02 seconds.
Total run time: 0.62 seconds.
Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.
% total % self total seconds self seconds name 100.00 0.62 38.71 0.24 "funOmit" 48.39 0.30 48.39 0.30 "|" 12.90 0.08 12.90 0.08 "is.na" % self % total self seconds total seconds name 48.39 0.30 48.39 0.30 "|" 38.71 0.24 100.00 0.62 "funOmit" 12.90 0.08 12.90 0.08 "is.na"