CS 281B / Stat 241B, Spring 2014:

Statistical Learning Theory

Readings

This page contains pointers to a collection of optional readings, in case you wish to delve further into the topics covered in lectures.

Probabilistic and game-theoretic formulation of prediction problems

The following text books describe probabilistic formulations of prediction problems:
`A Probabilistic Theory of Pattern Recognition,' L. Devroye, L. Gyorfi, G. Lugosi, Springer, New York, 1996.
`Statistical Learning Theory,' Vladimir N. Vapnik, Wiley, 1998.
`Neural Network Learning: Theoretical Foundations,' Martin Anthony and Peter L. Bartlett, Cambridge University Press, 1999.
`An Elementary Introduction to Statistical Learning Theory,' Sanjeev Kulkarni and Gilbert Harman, Wiley, 2011.
`Foundations of Machine Learning,' Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, MIT Press, 2012.

This text book describes game-theoretic formulations of prediction problems:
`Prediction, Learning, and Games.' N. Cesa-Bianchi and G. Lugosi, Cambridge University Press, 2006.

See also the following review papers.
'Theory of Classification: a Survey of Recent Advances' Stephane Boucheron, Olivier Bousquet and Gabor Lugosi.
'Online learning and online convex optimization' Shai Shalev-Shwartz.

Perceptron Algorithm

The argument giving the minimax lower bound for linear threshold functions is similar to the proof of the main result in the following paper.
`A General Lower Bound on the Number of Examples Needed for Learning.' A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant.

The following are old (1987 and 1990) revisions of older (1969 and 1965, respectively) books on linear threshold functions, the perceptron algorithm, and the perceptron convergence theorem.
Perceptrons: An Introduction to Computational Geometry Marvin L. Minsky, Seymour A. Papert, MIT Press, 1987.
The Mathematical Foundations of Learning Machines, Nilsson, N., San Francisco: Morgan Kaufmann, 1990.

The upper bound on risk for the perceptron algorithm that we saw in lectures follows from the perceptron convergence theorem and results converting mistake bounded algorithms to average risk bounds. The following paper reviews these results.
Large margin classification using the perceptron algorithm. Yoav Freund and Robert E. Schapire.

Risk Bounds and Uniform Convergence

The following books give more detail on concentration inequalities.
Metric Characterization of Random Variables and Random Processes. V.~Buldygin and I.~Kozachenko. American Mathematical Society, 2000.
Concentration Inequalities: A Nonasymptotic Theory of Independence. S. Boucheron, G. Lugosi and P. Massart. Oxford University Press, 2013.
See also this paper.
`Concentration inequalities.' S. Boucheron, O. Bousquet and G. Lugosi.

The Hoeffding-Azuma inequality:
`Weighted sums of certain dependent random variables.' Kazuoki Azuma. 1967.
`Probability inequalities for sums of bounded random variables.' Wassily Hoeffding. 1963.
`On the method of bounded differences.' Colin McDiarmid. 1989.

Rademacher averages

`A few notes on Statistical Learning Theory.' Shahar Mendelson.

`Rademacher and Gaussian complexities: risk bounds and structural results' P. L. Bartlett and S. Mendelson.

'Model selection and error estimation.' P. Bartlett, S. Boucheron and G. Lugosi.

'Rademacher penalties and structural risk minimization.' Vladimir Koltchinskii.

Rademacher averages for large margin classifiers:
`Empirical margin distributions and bounding the generalization error of combined classifiers.' Vladimir Koltchinskii and Dmitriy Panchenko.

The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2 in this paper, which also introduces local Rademacher averages.
`Some applications of concentration inequalities to statistics.' Pascal Massart.

The contraction inequality in Lecture 8 is Corollary 3.17 in this book.
Probability in Banach Spaces: Isoperimetry and Processes. M. Ledoux and M. Talagrand. Springer, 1991.

The growth function, VC-dimension, and pseudodimension are described in the following text (see chapter 3). Estimates of these quantities for parameterized function classes is covered in chapters 7 and 8.
`Neural network learning: Theoretical foundations.' Martin Anthony and Peter Bartlett. Cambridge University Press. 1999.

Online learning

Early papers on prediction of individual sequences:

`Aggregating strategies.' V. Vovk.
`How to use expert advice.' N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth.
`On prediction of individual sequences.' N. Cesa-Bianchi and G. Lugosi

Online learning

The treatment in lectures draws heavily on Chapter 11 of `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.

Lecture notes from an earlier version of this course:

Lecture notes on online learning A. Rakhlin.

Survey chapter on online convex optimization:

Online learning and online convex optimization S. Shalev-Shwartz.

The following paper introduced the online convex optimization formulation.

`Online convex programming and generalized infinitesimal gradient ascent.' M. Zinkevich.

Logarithmic regret for strongly convex functions.

`Logarithmic regret algorithms for online convex optimization' E. Hazan, A. Agarwal, and S. Kale.

The following paper gives some results on the relationship between prediction in adversarial and probabilistic settings.

`On the generalization ability of on-line learning algorithms.' N. Cesa-Bianchi, A. Conconi, and C. Gentile.

The simple online-to-batch conversion described in the lecture is based on Theorem 5 from this paper.

Exploiting random walks for learning, P. L. Bartlett, P. Fischer, K.-U. Hoeffgen, COLT 1994.

Optimal regret and the formulation of the dual game, plus the proof of the upper bound in terms of sequential Rademacher averages.

A Stochastic View of Optimal Regret through Minimax Duality J. Abernethy, A. Agarwal, P. L. Bartlett, A. Rakhlin.

See also this paper, which has the lower bound for online learning with absolute loss in terms of sequential Rademacher averages.

Online Learning: Random Averages, Combinatorial Parameters, and Learnability A. Rakhlin, K. Sridharan, A. Tewari.

Kernel Methods, Support Vector Machines

The following two survey papers give nice overviews of kernel methods.
(Note that Section 2 in both papers provides some worthwhile intuition, but the theorems are only superficially related to kernel methods.)

`An introduction to kernel-based learning algorithms.' K.-R. Mueller, S. Mika, G. Raetsch, K. Tsuda, and B. Schoelkopf.
`A Tutorial on Support Vector Machines for Pattern Recognition.' C. J. C. Burges.

The following papers introduce the hard margin and soft margin SVMs.

`A training algorithm for optimal margin classifiers.' B. Boser, I. Guyon, and V. Vapnik.
`Support Vector Networks.' C. Cortes and V. Vapnik.

Ensemble Methods

Early boosting papers.

`A decision-theoretic generalization of on-line learning and an application to boosting.' Yoav Freund and Robert E. Schapire.
`Experiments with a new boosting algorithm.' Yoav Freund and Robert E. Schapire.

This paper contains the result in lectures about the relationship between the existence of weak learners and the existence of a large margin convex combination. It also contains bounds on the misclassification probability of a large margin classifier.

`Boosting the margin: A new explanation for the effectiveness of voting methods.' Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee.

Some surveys of boosting.

`The boosting approach to machine learning: An overview.' Robert E. Schapire.
`Boosting: Foundations and Algorithms.' Robert E. Schapire and Yoav Freund. MIT Press. 2012.

Two other views of boosting algorithms.

`Arcing classifiers.' Leo Breiman.
`Additive logistic regression: a statistical view of boosting.' Jerome Friedman, Trevor Hastie and Robert Tibshirani.

An extension of AdaBoost to real-valued base classifiers.

`Improved boosting algorithms using confidence-rated predictions' R. E. Schapire and Y. Singer.

The analysis in lectures of AdaBoost as an entropy projection method follows Chapter 8 of Schapire and Freund's text, which in turn is based on this paper, and references:

`Logistic Regression, AdaBoost and Bregman Distances.' M. Collins, R. E. Schapire and Y. Singer.

The oracle inequality for complexity-regularized model selection is from this paper:

`Model selection and error estimation' Peter L. Bartlett, Stéphane Boucheron and Gábor Lugosi.

This talk has some more on oracle inequalities and universal consistency:

`Regression Methods for Pattern Classification: Statistical Properties of Large Margin Classifiers.' Peter Bartlett.

The universal consistency analysis in lectures for AdaBoost follows Chapter 12 of Schapire and Freund's text, which is based on these papers:

`AdaBoost is Consistent.' Peter L. Bartlett and Mikhail Traskin
`The Rate of Convergence of AdaBoost.' Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

The latter paper improved the algorithmic convergence rate bounds for AdaBoost over the bound from the following paper, which was the result used in the first proof of universal consistency above.

`Some Theory for Generalized Boosting Algorithms.' Peter J. Bickel, Ya’acov Ritov, and Alon Zakai.

The following paper shows that the universal approximation assumption made in that analysis is essential: with a class of base classifiers that is too simple, a broad family of methods based on convex surrogate loss minimization (including AdaBoost) will fail spectacularly on some simple noisy problems.

`Random Classiﬁcation Noise Defeats All Convex Potential Boosters.' Philip M. Long and Rocco A. Servedio

The following paper extends the universal consistency result to the logistic loss.

`Boosting with the Logistic Loss is Consistent.' Matus Telgarsky.

See also the readings in this previous incarnation of the course.

Back to course home page