DSAgenerate {DSA}R Documentation

D/S/A algorithm to generate models without cross-validation

Description

Generates models with the Deletion/Substitution/Addition (D/S/A) algorithm under user-specified constraints.

Usage

DSAgenerate(formula, data, family = gaussian, weights = NULL, maxsize,
orderint, maxsumofpow, silent = TRUE, ...)

Arguments

formula a symbolic description of the base model which specifies the independent/response variable(s) and all terms forced in the models returned. Typically, formula is set to Y ~ 1 when no terms are forced in the models returned. Currently supported outcomes are continuous or binomial (binary with 0s and 1s or a two-column matrix of successes in the first column and failures in the second). Dependent/Explanatory candidate variables can be continuous or categorical (factors). Only polynomial main terms and polynomial interaction terms specified within the I() subroutine can be forced in the models returned, i.e. interactions between a factor and other variables are currently not suported.
data a non-optional data frame containing both the response variable(s) as well as the candidate covariates to be considered in the model search.
family currently either binomial, multinomial or gaussian. Used to determine whether logistic, multinomial (logit link only) or general linear models should be considered.
weights a vector of real numbers whose number of elements is the number of observations. This vector contains the weights to be applied to each observation in the learning set (data). The argument weights is ignored if the value of weights is NULL (default).
maxsize an integer strictly positive which limits the model search to candidate linear models with a maximum number of terms (excluding the intercept) lower or equal to maxsize. The argument maxsize must be larger or equal to the number of terms forced into the models through formula.
orderint an integer strictly positive which limits the model search to candidate linear models with a maximum order of interactions equal to orderint. The argument orderint must be 1 if the number of terms forced in the models is equal to maxsize.
maxsumofpow an integer larger or equal to maxorderint which limits the model search to candidate generalized linear models with terms involving variables whose powers sum up to a value lower or equal to maxsumofpow. The argument maxsumofpow must be 1 if the number of terms forced in the models is equal to maxsize.
silent if FALSE then intermediate messages will be printed to standard output showing the progress of the computations (message level set to 0). if TRUE then no intermediate messages will be printed to standard output (message level set to -1). One can further control the messaging level to be printed to standard output with a call to the setDSAMessageLevel subroutine preceding a call to the DSA routine where the argument silent is not referenced.
... currently used internally to recursively call the DSA.

Details

The DSAgenerate routine implements the Deletion/Substitution/Addition (D/S/A) algorithm to generate candidate estimators defined with polynomial generalized linear models. Unlike the DSA routine, DSAgenerate does not perform data-adaptive etsimation based on cross-validation. Instead, DSAgenerate returns a list of estimators defined by models of size 1 to maxsize such that each estimator minimizes the empirical risk (not the cross-validated risk) among all estimators of the same size considered. The space of considered candidate estimators is parameterized with three variables: maxsize, orderint and maxsumofpow.

The D/S/A algorithm is an aggressive model search algorithm which iteratively generates polynomial generalized linear models based on the existing terms in the current 'best' model and the following three steps: 1) a deletion step which removes a term from the model, 2) a substitution step which replaces one term with another, and 3) an addition step which adds a term to the model. The search for the 'best' estimators starts with the base model specified with formula: typically the intercept model except when the user requires a number of terms to be forced in the models returned.

The search for the 'best' estimators is limited by three user-specified arguments: maxsize, orderint and maxsumofpow. The first argument limits the maximum number of terms in the models considered (excluding the intercept). The second argument limits the maximum order of interactions for the models considered. All terms in the models considered are composed of interactions of variables raised to a given power. The third argument limits the maximum sum of powers in each term.

Value

a list of four objects is returned:

models.allsizes list of the 'best' models for each size considered and corresponding with the order of interaction equal to orderint.
family a description of the link function defining the candidate models considered: gaussian indicates the indentity function while binomial and multinomial indicates the logit function.
computing.time an object of class difftime which contains the information about the computing time associated with this call to DSAgenerate.
call the DSAgenerate call which generated this object.

Author(s)

Romain Neugebauer and James Bullard based on the original C code from Sandra Sinisi.

References

1. Sandra E. Sinisi and Mark J. van der Laan, "Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation" (March 2004). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 143. http://www.bepress.com/ucbbiostat/paper143

See Also

DSA, setDSAMessageLevel, getDSAMessageLevel, difftime.

Examples

library(DSA)

##
## an example using binomial - two column outcome and forced terms
##
n <- 1000
W <- cbind(rnorm(n), rnorm(n) < 1, rnorm(n) < 2, rnorm(n, 2, 4), runif(n),
           rcauchy(n), rlogis(n) < .1, rnorm(n) < .1, rnorm(n, 120, 10),
           rnorm(n, 66, 2))

Y <- 10 + .5*W[,1] + .02*W[,1]^2 + .01*W[,1]*W[,2] + 2*W[,3] + .7*W[,4]^2 
Y <- as.matrix(as.integer(Y - mean(Y))/sd(Y))
trials <- rpois(n, lambda = 20)
successes <- sapply(1:n, function(i) {
  rbinom(1, size = trials[i], prob = pnorm(Y[i]))
})
failures <- trials - successes
colnames(W) <- paste("V", 1:ncol(W), sep = "")
data <- as.data.frame(cbind(W, "successes" = successes, "failures" = failures))

res <- DSA(cbind(successes, failures) ~ 1, data = data, family = binomial, maxsize = 8,
           maxorderint = 2, maxsumofpow = 3)  
summary(res)
plot(res)

## one may prefer to use a model with an order of interaction equal to 2
res <- DSAgenerate(cbind(successes, failures) ~ 1, data = data, family = binomial, maxsize = 8,
           orderint = 2, maxsumofpow = 3)  
res

## now add some forced terms
res <- DSAgenerate(cbind(successes, failures) ~ V1 + I(V4^2), data = data, family = binomial, maxsize = 8,
           orderint = 2, maxsumofpow = 3) 
res

[Package DSA version 3.1 Index]