DSA {DSA}R Documentation

Data-Adaptive Estimation with Cross-Validation

Description

Performs data-adaptive estimation through the selection of the estimator which minimizes the cross-validated risk defined by the L2 loss function. The candidate estimators are defined with semi-parametric models. The candidate models are polynomial generalized linear models generated with the Deletion/Substitution/Addition algorithm under user-specified constraints.

Usage

DSA(Xlearn, Ylearn, binind, maxsize, maxorderint, maxsumofpow, nfolds=5,
    IDlearn=NULL, forcedterms=NULL, CENSOR=0, weights=NULL,
    vsplits=NULL, userseed=NULL)

Arguments

Xlearn a matrix of real numbers containing the candidate covariates in columns. Missing values are allowed. Each column of 'Xlearn' must be uniquely named.
Ylearn a matrix containing the outcome in one or two named column(s). Ylearn is a two-column matrix of integers if the outcome is binomial and a one-column matrix if the outcome is binary or continuous. When the outcome is binomial, the first column of Ylearn contains the number of successes and the second column contains the number of failures. The column names must be different from the column names in 'Xlearn'. Missing values are allowed. The number of rows in 'Ylearn', i.e. the number of observations, must correspond to the number of rows in 'Xlearn'. In any given row, both matrices must contain the data for the same observation.
binind a binary indicator that the outcome is binomial, i.e. binind=1 if 'Ylearn' is a binomial outcome and binind=0 otherwise.
maxsize an integer strictly positive which limits the data-adaptive estimation to candidate linear models with a maximum number of terms (excluding the intercept) lower or equal to 'maxsize'. The argument 'maxsize' must be larger or equal to the number of terms forced into the model through 'forcedterms'.
maxorderint an integer strictly positive which limits the data-adaptive estimation to candidate linear models with a maximum order of interactions lower of equal to 'maxorderint'. The argument 'maxordeint' must be 1 if the number of terms forced in the model is equal to 'maxsize'.
maxsumofpow an integer larger or equal to 'maxorderint' which limits the data-adaptive estimation to candidate linear models with terms involving candidate variables whose powers sum up to a value lower or equal to 'maxsumofpow'. The argument 'maxsumofpow' must be 1 if the number of terms forced in the model is equal to 'maxsize'.
nfolds an integer strictly larger than 1 indicating the number of data splits for the cross-validation procedure. The default value for 'nfolds' is 5.
IDlearn a vector identifying each independent experimental unit in the data with a unique value, i.e. repeated measurements are allowed in 'Xlearn' and 'Ylearn'. The length of 'IDlearn' must correspond to the number of observations. The default value for 'IDlearn' is NULL which indicates that all observations are independent.
forcedterms a formula specifying the terms that should be forced in all candidate models or NULL (default) if no term is to be forced in the models considered. Interactions must be specified with '*' embedded within 'I' and polynomial powers should be specified with '^'.
CENSOR a binary indicator that weights should be used to fit each candidate model, i.e. CENSOR=1 if weights should be used and CENSOR=0 otherwise. The default value for 'CENSOR' is 0.
weights a matrix of real numbers whose number of rows is nfolds+1 and whose number of columns is the number of observations. The first 'nfolds' rows contain the weights to be applied to each observation (whether in the training or validation set) at each split of the data. The last row contains the weights to be applied to each observation in the learning set. The argument 'weights' is ignored if CENSOR=0 and the default value for 'weights' is NULL.
vsplits a matrix of 0's or 1's whose number of rows is 'nfolds' and whose number of column is the number of observations. Each row indicates how the learning set is split in a training (0's) and validation (1's) set. The default value for 'vsplits' is NULL which indicates that the data should be randomely split based on the v-fold splitting scheme, the values for 'userseed', 'nfolds' and 'IDlearn'.
userseed a single value interpreted as an integer and used to set the seed of the random number generator which determines the v-fold data splits. If userseed=NULL (default) then the default seed from the current R session is used. The argument 'userseed' is ignored if 'vsplits' is not NULL.

Details

The DSA routine implements a general data-adaptive estimation procedure based on cross-validation and the L2 loss function. The final estimator is selected from a set of candidate estimators generated by the Deletion/Substitution/Addition (DSA) algorithm. All estimators considered are defined based on semi-parametric models of a conditional mean expection of the outcome. These models are generalized linear polynomial models. The final model minimizes the cross-validated risk defined by the L2 loss function.

The DSA algotithm is an aggressive model search algorithm which iteratively generates generalized linear polynomial models based on the existing terms in a model and the following three steps: 1) a deletion step which removes a term from the model, 2) a substitution step which replaces one term with another, and 3) an addition step which adds a term to the model. The search for the best estimator starts with the simplest model: the intercept model except when the user imposes a number of terms in the models to be considered. These forced terms are identified using the 'forcedterms' argument.

The search for the best estimator is limited by three user-specified arguments: 'maxsize', 'maxorderint' and 'maxsumofpow'. The first argument limits the maximum number of terms in the models considered. The second argument limits the maximum order of interactions for the models considered. Each term is composed of an interactions of candidate variables raised to a given power. The third argument limits the maximum sum of powers in each term.

This data-adaptive estimation procedure allows comparison of models based on different number of observations and can account for informative censoring through the use of weights in each regression. The argument 'CENSOR' must then be set to 1 and an adequate matrix of 'weights' must be provided.

In addition, this routine currently supports data-adaptive estimation of conditional mean outcomes where the outcome is either continuous or binomial. When the outcome is binomial, the estimators considered are based on generalized polynomial models where the link function is the logit function.

The default cross-validation splitting scheme for the data is v-fold where the value for v is specified by the argument 'nfolds'. The DSA routine performs the data splits based on the value for 'nfolds', 'IDlearn' and the 'userseed' arguments. 'IDlearn' identifies the independent experimental units in the data and ensures that the training and validations sets are independent. The argument 'userseed' is used to set the seed of the R random number generators with the routine 'set.seed()'. This allows for reproducible results. The user can use alternative data splits by providing the argument 'vsplits'. This argument will typically need to be provided when CENSOR=1.

Value

'DSA' returns a list with the following objects:

average.CVrisks a matrix containing the average cross-validated risks defined by the L2 loss function over the splits of the data for each selected best models of size 0 to 'maxsize' (intercept excluded) and maximum order of interactions 1 to 'maxorderint'. The average cross-validated risk in row i and column j corresponds to the average cross-validated risk for the best model of size j (intercept excluded) and maximum order of interactions i. Note that the average cross-validated risk for agiven size and order of interaction is the average of the average risks computed on each validation sets, where the average risks on a given validation set corresponds to the residuals sum of squares divided by the number of observation in the validation set.
min.risk a vector with two values respectively identifying the row and column numbers for the minimum average cross-validated risk in 'DSA$average.CVrisks'
model.selected a formula representing the final model selected by the DSA call. The names of the variables corresponds to the column names in Ylearn and Xlearn.
family a description of the link function for the candidate models: 'gaussian' indicates the indentity function while 'binomial' indicates the logit function.
coefficients The coefficients of the final model fitted on the learning set.
seed NULL if the argument 'vplits' was provided by the user or if the user did not provide 'userseed', otherwise equal to 'userseed'.
vsplits the 'vsplits' matrix if provided by the user or the corresponding matrix randomely generated by the DSA call.
computing.time an object of class 'difftime' which contains the information about the computing time necessary for the DSA call to be completed.

Note

The computing time for a DSA call can be very large when the number of candidate variables, 'maxorderint', 'maxsize' and 'maxsumofpow' is large. DSA calls are more computing intensive with binomial outcomes. It is often recommended to run this routine in BATCH mode.

The DSA routine can be used to compute the cross-validated risk for a user-specified model through the argument 'forcedterms' and by setting 'maxsize' to the number of terms in the model.

Author(s)

Romain Neugebauer and James Bullard based on the original C code from Sandra Sinisi

References

1. Mark J. van der Laan and Sandrine Dudoit, "Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples" (November 2003). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 130. http://www.bepress.com/ucbbiostat/paper130

2. Mark J. van der Laan, Sandrine Dudoit, and Aad W. van der Vaart, "The Cross-Validated Adaptive Epsilon-Net Estimator" (February 2004). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 142. http://www.bepress.com/ucbbiostat/paper142

3. Sandra E. Sinisi and Mark J. van der Laan, "Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation" (March 2004). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 143. http://www.bepress.com/ucbbiostat/paper143

See Also

I, set.seed, difftime.

Examples

  require(DSA)
  set.DSA.message.level(-1) # no printing. 

  ## an example using the state census data. (ols)  
  data(state)
  W <- state.x77[,c(1,2,3,4,6,7,8)]
  Y <- as.matrix(W[,1]*state.x77[,5]/100)
  colnames(W) <- c("pop", "inc", "illiteracy", "lifeexp", "hsgrad",
                   "frost", "area")
  colnames(Y) <- c("murders")
  res <- DSA(W, Y, binind = 0, maxsize = 5, maxsumofpow = 2, maxorderint = 2)
  res$model.selected

  ## an example using weights (ols).
  ddir <- paste(system.file(package = "DSA"), "/", "data", "/", sep = "")
  x <- as.matrix(read.table(paste(ddir, "smalllymphRoX.txt", sep = ""),nrow=240))
  dimnames(x) <- list(NULL,dimnames(x)[[2]])
  y <- as.matrix(read.table(paste(ddir, "logT1lymph.txt", sep = "")))
  dimnames(y) <- list(NULL,"Y")
  SplitBn <- as.matrix(read.table(paste(ddir, "bnVectors_lymph_v5", sep = "")))
  WeightsV <- as.matrix(read.table(paste(ddir, "lymphWeights_111505", sep = "")))
  
  res <- DSA(Xlearn=x, Ylearn=y, binind=0, maxsize=5, maxorderint=2, maxsumofpow=2,
             nfolds=5, CENSOR=1, weights=WeightsV, vsplits=SplitBn)
  res$model.selected

  ## an example using simulated data (logistic).
  n <- 1000
  W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n),
             rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10),
             rnorm(n, 66, 2))

  Y <- 22 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2 
  Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y))

  Y <- as.matrix(as.integer(Y < rlogis(1))) # apply(Y, 1, function(x) { rbinom(1,1,exp(x)/(1+exp(x))) }))

  colnames(W) <- paste("V", 1:ncol(W), sep = "")
  colnames(Y) <- "Y"
  
  res <- DSA(W, Y, binind = 1, maxsize = 8, maxorderint = 2, maxsumofpow = 3)  


[Package DSA version 1.1 Index]