DSA {DSA} | R Documentation |
Performs data-adaptive estimation through estimator selection based on cross-validation and the L2 loss function. Candidate estimators are defined with polynomial generalized linear models generated with the Deletion/Substitution/Addition (D/S/A) algorithm under user-specified constraints.
DSA(formula, data, id = 1:nrow(data), family = gaussian, weights = NULL,candidate.rank = NULL, rank.cutoffs = 1, maxsize, maxorderint, maxsumofpow, usersplits = NULL,userseed = NULL, vfold = 5, nsplits = 1, model.matrices = FALSE, silent = TRUE, ...)
formula |
a symbolic description of the base model which specifies the
independent/response variable(s) and all terms forced in
the final model. Typically, formula is set to
Y ~ 1 when no terms are forced in the final model. Currently supported outcomes are
continuous or binomial (binary with 0s and 1s or a two-column
matrix of successes in the first column and failures in the
second). Dependent/Explanatory candidate variables can be continuous
or categorical (factors). Only polynomial main terms and
polynomial interaction terms specified within the I()
subroutine can be forced in the final model, i.e. interactions
between a factor and other variables are currently not suported.
|
data |
a non-optional data frame containing both the response variable(s) as well as the candidate covariates to be considered in the data-adaptive estimation procedure. |
id |
a vector identifying each independent experimental unit
in the data with a unique value, i.e. repeated measurements are
allowed in data . The length of id must correspond to the number of
observations (nrow(data) ). The default value for id is 1:nrow(data) which
indicates that all observations are independent.
|
family |
currently either binomial or gaussian . Used to determine whether
logistic or general linear models should be considered.
|
weights |
a matrix of real numbers whose number of rows
is the number of data splits plus one (i.e., vfold+1 or dim(usersplits)[[1]]+1 ) and whose number of columns is
the number of observations. The first vfold or dim(usersplits)[[1]] rows contain the weights to be applied to
each observation (whether in the training or validation set) at each
split of the data. The last row contains
the weights to be applied to each observation in the learning
set. The argument weights is ignored if the value of
weights is NULL (default).
|
candidate.rank |
a vector of named real numbers which ranks all
candidate variables in data . Missing values are not allowed. Each element's name must
correspond to the name of one of the candidate variables in
data . The candidate variables' ranks are used to
determine which set(s) of covariates in data will actually be
considered in the DSA call based on the argument rank.cutoffs . When
candidate.rank is NULL (default), the variables in
data are ranked with a call to candidateReduction . Note that
the candidate dummy variables automatically created for all factors in data receive different
ranks by default while their ranks are identical when candidate.rank is user-supplied.
|
rank.cutoffs |
a vector of real numbers which is used with
candidate.rank to define the set(s) of candidate variables in
data that will be considered in the data-adaptive estimation procedure. By default,
rank.cutoffs is set to 1, which means that all candidate
variables in data are considered in the DSA call if
candidate.rank is NULL .
|
maxsize |
an integer strictly positive which limits the data-adaptive
estimation procedure to candidate linear models with a maximum
number of terms (excluding the intercept) lower or equal to
maxsize . The argument maxsize must be larger or equal to the
number of terms forced into the model through formula .
|
maxorderint |
an integer strictly positive which limits the data-adaptive
estimation procedure to candidate linear models with a maximum
order of interactions lower or equal to maxorderint .
|
maxsumofpow |
an integer larger or equal to maxorderint which limits the data-adaptive
estimation procedure to candidate generalized linear models with terms involving
variables whose powers sum up to a value lower or equal to
maxsumofpow .
|
vfold |
an integer strictly larger than 1 specifying the number of splits
(i.e, v) in
the default v-fold splitting scheme for the
cross-validation procedure. The default value for vfold is
5 . The argument vfold is ignored if the argument
usersplits is non-null.
|
nsplits |
an integer larger than 1 indicating the number of v-fold splits for the
cross-validation procedure. The default value for nsplits is
1 . This argument is ignored if the user provides the
data splitting scheme with usersplits .
|
usersplits |
an optional matrix of 0s or 1s whose number of columns is the number of
observations. Each row indicates how the learning set
is split in a training (0s) and validation (1s) set. The default
value for usersplits is NULL which indicates that the data should be
randomely split based on the v-fold splitting scheme corresponding to the values
for userseed , vfold , nsplits , and id .
|
userseed |
a single value interpreted as an integer and used to
set the seed of the random number
generator which determines the v-fold data splits. If userseed=NULL (default) then
the current seed from the current R session is used. The argument
userseed is ignored if usersplits is non-null. |
model.matrices |
a logical variable indicating that only the design and response matrices should be returned. |
silent |
if FALSE then
intermediate messages will be printed to standard output showing the progress of the
computations (message level set to 0). if TRUE then no
intermediate messages will be printed to standard output (message
level set to -1). One can further control the messaging level to be
printed to standard output with a call to the setDSAMessageLevel
subroutine preceding a call to the DSA routine where the argument
silent is not referenced.
|
... |
currently used internally to recursively call the DSA. |
The DSA routine implements a general data-adaptive estimation
procedure based on cross-validation and the L2 loss function. The final estimator is selected from a set
of candidate estimators defined with polynomial generalized linear models generated by the
Deletion/Substitution/Addition (D/S/A) algorithm. The space of
candidate estimators is parameterized with four variables:
maxsize
, maxorderint
, maxsumofpow
and rank.cutoffs
. The final
model returned minimizes the empirical risk on the
learning set among all estimators considered and characterized by the
"optimum" size, order of interactions and set of candidate variables selected by cross-validation.
The D/S/A algorithm is an aggressive model search algorithm which
iteratively generates polynomial generalized linear models based on
the existing terms in the current 'best' model and the following three steps: 1) a deletion step which removes a term from the model,
2) a substitution step which replaces one term with another, and 3) an
addition step which adds a term to the model. The search for the 'best'
estimator starts with the base model specified with formula
:
typically the intercept model except
when the user requires a number of terms to be forced in the final model.
The search for the 'best' estimator is limited by four user-specified
arguments: maxsize
, maxorderint
, maxsumofpow
and rank.cutoffs
. The first argument
limits the maximum number of terms in the models considered (excluding
the intercept). The second
argument limits the maximum order of interactions for the models
considered. All terms in the models considered are composed of interactions of
variables raised to a given power. The third argument limits the
maximum sum of powers in each term. The fourth argument limits the set of candidate variables to
be considered in each model. Only the variables whose ranks specified
by candidate.rank
are smaller than the threshold(s) specified by
rank.cutoffs
are allowed in the models considered (and, if applicable, the variable(s) forced in the model). Note that
the default ranking of the candidate variables in data
is based
on their univariate association with the independent variable(s) and
obtained with the routine candidateReduction
.
This data-adaptive estimation procedure allows comparison of models
based on different number of observations and can account for
informative censoring through the use of weights in each
regression. These weights are provided to the DSA routine with the
argument weights
.
The DSA routine currently supports data-adaptive estimation
for continuous or binomial outcomes. When the outcome is binomial, the estimators considered are
based on polynomial generalized linear models where the link function is the
logit function. Factors can be candidate variables with the caveat
that there are currently limitations
to the use of factors in terms forced in the final model (see formula
above).
The default cross-validation splitting scheme is v-fold
where the value for v is specified with the argument vfold
. The DSA
routine performs the data splits based on the value for
vfold
, nsplits
, id
and the userseed
arguments. The argument nsplits
specifies the number of v-fold
splits, e.g. if nsplits=2
and vfold=5
then the data is split twice based on
the 5-fold splitting scheme and thus the DSA call relies on 10 data splits. The
argument id
identifies the independent experimental units in the data and ensures
that the training and validation sets are independent.
The argument userseed
is
used to set the seed of the R random number generators with the routine
set.seed()
. This allows for reproducible results. The user can specify
an alternative cross-validation splitting scheme with the argument
usersplits
.
The DSA employs two different tolerance values: 1) to compare whether two number are different (by default set to .Machine$eps*100) and 2) to evaluate convergence or rank defficiency when fitting models (by default set to 1e-8). Both defaults can be changed using similar calls to the C methods in zzz.R.
If model.matrices=TRUE
then a list of two objects is returned, X
and Y
, corresponding respectively with the
matrices of all candidate variables and outcome(s) which would be
considered (pre-dimension reduction) by the data-adaptive estimation procedure if
model.matrices
were set to FALSE
(default).
Otherwise, an object of class 'DSA' is returned. The traditional print
,
summary
, predict
, and plot
methods are then available on the returned
object. Objects of class 'DSA' have the following attributes:
average.CVrisks |
a matrix or tridimensional array containing the average cross-validated risks
defined by the L2 loss function over the splits of the data for estimators (models) indexed by
1) a size ranging from 1 to maxsize+1 (intercept included);
2) a maximum order of interactions ranging from 1 to maxorderint ; and
3) by the dimension reduction levels specified in rank.cutoffs .
The average cross-validated risk with coordinates (i,j,k) corresponds to the average cross-validated risk
for estimators (models) indexed by a maximum order of interactions i, size j (intercept included), and a rank cut-off in rank.cutoffs .
Note that the average cross-validated risk for a given size, order of interactions and dimension reduction level is the average across data splits of the
average risks computed on the validation sets. The average
risks on a given validation set corresponds to the residuals sum of
squares (missing residuals excluded) divided by the
number of (complete) observations in the validation set. |
min.risk |
the coordinates identifying the minimum average cross-validated risk in
average.CVrisks . Note that the corresponding average cross-validated
risk does not typically correspond to the average cross-validated risk for the final
model selected by the DSA routine. |
final.cutoff |
the 'best' or user-specified dimension reduction level corresponding with the model selected. |
model.selected |
a formula representing the final model selected
by the DSA routine. The variables in model.selected
correspond to variables
in data . |
family |
a description of the link function for the final
generalized linear model selected: gaussian indicates
the indentity function while binomial indicates the logit function. |
coefficients |
the coefficients of the final model fitted on the learning set. |
average.CVrisk.model.selected |
the average cross-validated risk
for model.selected . |
models.allsizes |
list of the 'best' models of each size considered on the learning set and corresponding with the 'optimum' order of interaction and dimension reduction level selected by cross-validation. |
candidate.vars |
list of the candidate variables truly considered in
the data-adaptive estimation procedure (post-dimension reduction).
Note that this list does not necessarily contain all candidate variables in data but only the variables whose ranks are lower than or equal to the largest cut-off
value in rank.cutoffs . Note that this list includes the variables forced in the final
model, if applicable. |
selected.vars |
list of the variables corresponding with the
'best' dimension reduction level, i.e. all variables whose ranks are
lower than or equal to the 'best' cut-off value selected by
cross-validation. Note that this list includes the variables forced in the final
model, if applicable. If rank.cutoffs contains only one cut-off value then selected.vars is equal to candidate.vars. |
candidate.rank |
the original argument candidate.rank
passed in to the DSA routine or, if missing, the default ranking of all variables in data obtained with candidateReduction .
|
rank.cutoffs |
the original argument rank.cutoff
passed in to the DSA routine or, if missing, its default value of 1.
|
computing.time |
an object of class difftime which contains the
information about the computing time associated with this DSA object. |
call |
the DSA call which generated this DSA object. |
formula |
the original base formula passed in to the DSA routine. |
data |
the original data frame passed in to the DSA routine. |
id |
the value of id used in the DSA call. |
userseed |
NULL if the argument usersplits was provided by the user or
if the user did not provide userseed , otherwise equal to userseed . |
splits |
a matrix or a list of matrices that identifies how
the data were split, i.e. the usersplits matrix if provided by the user or
the corresponding matrix (when nsplits = 1 ) or list of matrices
(when nsplits > 1 ) randomely generated by the DSA routine. |
weights |
the weights used in the original call. |
The computing time for a DSA call can be very large depending on the input arguments which define the intensity of the estimator selection. DSA calls are more computing intensive with binomial outcomes. It is often recommended to run this routine in BATCH mode.
Romain Neugebauer and James Bullard based on the original C code from Sandra Sinisi.
1. Mark J. van der Laan and Sandrine Dudoit, "Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples" (November 2003). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 130. http://www.bepress.com/ucbbiostat/paper130
2. Mark J. van der Laan, Sandrine Dudoit, and Aad W. van der Vaart, "The Cross-Validated Adaptive Epsilon-Net Estimator" (February 2004). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 142. http://www.bepress.com/ucbbiostat/paper142
3. Sandra E. Sinisi and Mark J. van der Laan, "Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation" (March 2004). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 143. http://www.bepress.com/ucbbiostat/paper143
I
, set.seed
,
difftime
, setDSAMessageLevel
,
getDSAMessageLevel
, crossValidate
,
candidateReduction
, DSAgenerate
,
summary.DSA
,coefficients.DSA
,
residuals.DSA
, predict.DSA
,
plot.DSA
.
library(DSA) ## an example using the state census data. (gaussian) data(state) state.data <- as.data.frame(state.x77) colnames(state.data) <- unlist(lapply(strsplit(colnames(state.data), " "), function(x) paste(x, collapse = ""))) res <- DSA(Murder ~ 1, data = state.data, maxsize = 5, maxsumofpow = 2, maxorderint = 2) residuals(res) coefficients(res) summary(res) res <- DSA(Murder ~ 1, data = state.data, maxsize = 5, maxsumofpow = 2, maxorderint = 2,nsplits=10,silent=FALSE) residuals(res) coefficients(res) summary(res) ## an example using weights (gaussian). ddir <- paste(system.file(package = "DSA"), "/", "data", "/", sep = "") x <- read.table(paste(ddir, "smalllymphRoX.txt", sep = ""), nrow=240) y <- read.table(paste(ddir, "logT1lymph.txt", sep = "")) weights <- as.matrix(read.table(paste(ddir, "lymphWeights_111505", sep = ""))) colnames(y) <- "loglymph" data <- cbind(x, y) res <- DSA(loglymph ~ 1, data = data, maxsize = 5, maxorderint = 2, maxsumofpow = 2, weights = weights) plot(res) summary(res) ## an example using simulated data (binomial, logistic link). n <- 1000 W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n), rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10), rnorm(n, 66, 2)) Y <- 22 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2 Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y)) Y <- as.matrix(as.integer(Y < rlogis(n))) colnames(W) <- paste("V", 1:ncol(W), sep = "") colnames(Y) <- "Y" data <- as.data.frame(cbind(W, Y)) res <- DSA(Y ~ 1, data = data, family = binomial, maxsize = 8, maxorderint = 2, maxsumofpow = 3) plot(res) summary(res) ## an example with factors. n <- 1000 inc <- round(runif(n,1,4)) edu <- round(runif(n,1,4)) W <- data.frame(income = as.factor(c("10000", "10000.35000", "35000.100000", "100000")[inc]), age = round(runif(n, 20, 60)), weight = rnorm(n, 140, 30), degree = as.factor(c("highschool", "highschool.grad", "college", "post.graduate")[edu]), male = rbinom(10, size = 1, prob = .5)) Vertical.Leap <- 8 + ifelse(inc == 1, 2, 0) + (60 - W$age) * .1 + W$male * 4 + rnorm(n, 0, 4) data <- cbind(W, Vertical.Leap) res <- DSA(Vertical.Leap ~ 1, data = data, maxsize = 8, maxorderint = 2, maxsumofpow = 2) plot(res, plot.compare = TRUE) summary(res) ## now add some forced terms. res <- DSA(Vertical.Leap ~ degree, data = data, maxsize = 8, maxorderint = 2, maxsumofpow = 2) plot(res, plot.compare = TRUE) summary(res) ## ## an example using binomial - two column outcome and forced terms ## n <- 1000 W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n), rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10), rnorm(n, 66, 2)) Y <- 10 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2 Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y)) trials <- rpois(n, lambda = 20) successes <- sapply(1:n, function(i) { rbinom(1, size = trials[i], prob = pnorm(Y[i])) }) failures <- trials - successes colnames(W) <- paste("V", 1:ncol(W), sep = "") data <- as.data.frame(cbind(W, "successes" = successes, "failures" = failures)) res <- DSA(cbind(successes, failures) ~ 1, data = data, family = binomial, maxsize = 8, maxorderint = 2, maxsumofpow = 3) summary(res) plot(res) ## now add some forced terms res <- DSA(cbind(successes, failures) ~ V1 + I(V4^2), data = data, family = binomial, maxsize = 8, maxorderint = 2, maxsumofpow = 3) summary(res) plot(res) ## with user-specified dimension reduction res <- DSA(cbind(successes, failures) ~ V1, data=data,family=binomial, rank.cutoffs=0.1, maxsize = 5, maxorderint = 1, maxsumofpow = 2,userseed=29,nsplits=2,silent=FALSE) summary(res) res$candidate.vars res$selected.vars ## with dimension reduction based on cross-validation res <- DSA(cbind(successes, failures) ~ V1, data=data,family=binomial, rank.cutoffs=c(0.1,0.2,0.6), maxsize = 5, maxorderint = 1, maxsumofpow = 2,userseed=29) summary(res) res$candidate.vars res$selected.vars