Bin Yu

2. INFORMATION THEORY AND STATISTICS

2.1 Mininum Description Length Principle

The Minimum Description Length (MDL) principle is an approach to statistical modeling that uses a source-coding framework. It continues the tradition of interactions between statistics and information theory. This tradition is exemplified by the early book by Kullback in 1959 and the recent book by Cover and Thomas in 1991, and by the works of Csisz\'{a}r and others on I-divergence, a measure that finds many applications in statistics. As a general principle for modeling, MDL has been recognized and used by people who face modern stochastic modeling problems such as model selection in regression and time series, cluster analysis, image segmentation and compression, wavelet signal processing, and neural nets. The MDL idea, or shortest description idea, is very natural in statistical classification problems as in the works of Boulton and Wallace in 1968. It has also been applied to order selection problems in time series, as a useful alternative to AIC and BIC (in fact, the two-stage form of MDL model selection coincides with BIC). Another form of MDL model selection, predictive least squares, is a natural and valid empirical prediction error assessment and has been well studied in the time series context by several researchers including Hannan, Rissanen, Guo and Huang, Wei, and Lai. My work on MDL has focused on contrasting MDL with other methods in terms of conventional statistical criteria. Asymptotic expansions of AIC and the three forms of MDL model selection criteria are derived for finite-dimensional normal regression models in It is also shown in the paper that all three MDL criteria including BIC are consistent and prediction-optimal, while AIC is not prediction-optimal (and inconsistent as shown earlier by Shibata). A useful observation emanating from this work is that neither MDL (or BIC) nor AIC is a clear-cut superior method since it all depends on the bias-variance tradeoff in the model. As some of the early works on MDL in a nonparametric framework, obtain two kinds of results, reflecting the interaction of ideas and techniques from density estimation and information theory. One is a coding result: we extended Rissanen's universal coding lower bounds on redundancy from parametric families to the Lipschitz smooth density class by using Assouad's technique from density estimation and we gave a predictive coding scheme which achieves the rate in the coding lower bound in expectation and in an almost sure sense. The other is a density estimation result: we assessed the MDL-based histogram density estimator in that class. This histogram stimator turns out to have an extra $\log n$ factor when compared with the optimal rate. Lower bounds for other smooth classes are given in Further developments are contained in Minimax lower bounds for smooth continuous Markov source classes are technically much more challenging to obtain than in the iid case. Yu (1994b) gives a minimax redundancy lower bound in the Markov case via a recursive equation. Based on mutual information calculations, Yu (1996) derives minimax redundancy lower bounds for smooth density classes and hence unifies the minimax approaches to redundancy lower bounds in the parametric and nonparametric cases. Yu (1994c) explores connections between important inequalities in statistics and information theory. Rissanen and Yu (1996) studies MDL in the computational learning context. Last, shows, for the first time, that the extra $\log n$ factor is not necessary for MDL-based density estimators. In particular, we constructed an optimal-rate MDL-based histogram estimator that takes into the account the Lipschitz condition in the coding of histogram parameters.