H2O

Currently, I spend most of my time contributing to the open source machine learning project, H2O.

R packages:

rsparkling. The rsparkling R package provides bindings to H2O’s distributed machine learning algorithms via RStudio's sparklyr. In particular, rsparkling allows you to access the machine learning routines provided by the Sparkling Water Spark package. Together with sparklyr’s dplyr interface, you can easily create and tune H2O machine learning workflows on Spark, orchestrated entirely within R.

Latest Release:  CRAN  |  GitHub


h2oEnsemble. H2O Ensemble is an implementation of the Super Learner ensemble algorithm that is built upon the open source Java-based H2O machine learning platform for big data. H2O Ensemble is currently implemented as a stand-alone R package which makes use of the h2o package, the R interface to the H2O platform. There are a handful of supervised machine learning algorithms supported by H2O, all of which can be used as base learners for the ensemble. The following algorithms are supported: Generalized linear models (GLMs) with Elastic Net regularization, Gradient Boosting (GBM) with regression and classification trees, Random Forest and Deep Learning (multi-layer feed-forward neural networks).

Latest Release:  GitHub


subsemble. Subsemble is a general subset ensemble prediction method, which can be used for small, moderate, or large datasets. Subsemble partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a unique form of V-fold cross-validation to output a prediction function that combines the subset-specific fits. An oracle result provides a theoretical performance guarantee for Subsemble.

Latest Release:  CRAN  |  GitHub


cvAUC. This package contains various tools for working with and evaluating cross-validated area under the ROC curve (AUC) estimators. The primary functions of the package compute confidence intervals for cross-validated AUC estimates based on influence functions for both i.i.d. and pooled repeated measures data. One benefit to using influence curve based confidence intervals is that they require much less computation time than bootstrapping methods.

Latest Release:  CRAN  |  GitHub


casecontrolSL. This package is an extension to the SuperLearner R package that reduces computation time by implementing an inverse probability of censoring weighting (IPCW) scheme in combination with subsampling, for binary classification. This technique is very useful when you have a rare binary outcome.

Latest Release:  casecontrolSL_0.1-5.tar.gz  |  Manual

rHealthDataGov. An R interface for the HealthData.gov data API that allows for the easy filtering and retrieval of 33 health data resources. This package is part of the rOpenHealth project.

Latest Release:  CRAN  |  GitHub