CS 281B / Stat 241B, Spring 2014:

Statistical Learning Theory


This page contains pointers to a collection of optional readings, in case you wish to delve further into the topics covered in lectures.

Probabilistic and game-theoretic formulation of prediction problems

The following text books describe probabilistic formulations of prediction problems:
`A Probabilistic Theory of Pattern Recognition,' L. Devroye, L. Gyorfi, G. Lugosi, Springer, New York, 1996.
`Statistical Learning Theory,' Vladimir N. Vapnik, Wiley, 1998.
`Neural Network Learning: Theoretical Foundations,' Martin Anthony and Peter L. Bartlett, Cambridge University Press, 1999.
`An Elementary Introduction to Statistical Learning Theory,' Sanjeev Kulkarni and Gilbert Harman, Wiley, 2011.
`Foundations of Machine Learning,' Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, MIT Press, 2012.

This text book describes game-theoretic formulations of prediction problems:
`Prediction, Learning, and Games.' N. Cesa-Bianchi and G. Lugosi, Cambridge University Press, 2006.

See also the following review papers.
'Theory of Classification: a Survey of Recent Advances' Stephane Boucheron, Olivier Bousquet and Gabor Lugosi.
'Online learning and online convex optimization' Shai Shalev-Shwartz.
Perceptron Algorithm
The argument giving the minimax lower bound for linear threshold functions is similar to the proof of the main result in the following paper.
`A General Lower Bound on the Number of Examples Needed for Learning.' A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant.

The following are old (1987 and 1990) revisions of older (1969 and 1965, respectively) books on linear threshold functions, the perceptron algorithm, and the perceptron convergence theorem.
Perceptrons: An Introduction to Computational Geometry Marvin L. Minsky, Seymour A. Papert, MIT Press, 1987.
The Mathematical Foundations of Learning Machines, Nilsson, N., San Francisco: Morgan Kaufmann, 1990.

The upper bound on risk for the perceptron algorithm that we saw in lectures follows from the perceptron convergence theorem and results converting mistake bounded algorithms to average risk bounds. The following paper reviews these results.
Large margin classification using the perceptron algorithm. Yoav Freund and Robert E. Schapire.

Risk Bounds and Uniform Convergence

The following books give more detail on concentration inequalities.
Metric Characterization of Random Variables and Random Processes. V.~Buldygin and I.~Kozachenko. American Mathematical Society, 2000.
Concentration Inequalities: A Nonasymptotic Theory of Independence. S. Boucheron, G. Lugosi and P. Massart. Oxford University Press, 2013.
See also this paper.
`Concentration inequalities.' S. Boucheron, O. Bousquet and G. Lugosi.

The Hoeffding-Azuma inequality:
`Weighted sums of certain dependent random variables.' Kazuoki Azuma. 1967.
`Probability inequalities for sums of bounded random variables.' Wassily Hoeffding. 1963.
`On the method of bounded differences.' Colin McDiarmid. 1989.
Rademacher averages
`A few notes on Statistical Learning Theory.' Shahar Mendelson.

`Rademacher and Gaussian complexities: risk bounds and structural results' P. L. Bartlett and S. Mendelson.

'Model selection and error estimation.' P. Bartlett, S. Boucheron and G. Lugosi.

'Rademacher penalties and structural risk minimization.' Vladimir Koltchinskii.

Rademacher averages for large margin classifiers:
`Empirical margin distributions and bounding the generalization error of combined classifiers.' Vladimir Koltchinskii and Dmitriy Panchenko.

The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2 in this paper, which also introduces local Rademacher averages.
`Some applications of concentration inequalities to statistics.' Pascal Massart.

The contraction inequality in Lecture 8 is Corollary 3.17 in this book.
Probability in Banach Spaces: Isoperimetry and Processes. M. Ledoux and M. Talagrand. Springer, 1991.

The growth function, VC-dimension, and pseudodimension are described in the following text (see chapter 3). Estimates of these quantities for parameterized function classes is covered in chapters 7 and 8.
`Neural network learning: Theoretical foundations.' Martin Anthony and Peter Bartlett. Cambridge University Press. 1999.

Online learning

Early papers on prediction of individual sequences: See also the text book: The minimax algorithm for dot loss was described in: Follow the perturbed leader was described in the following papers: Tracking and adaptive regret are covered in: Normalized Maximum Likelihood: Universal portfolios: Context Tree Weighting: Mixable losses: Specialists:

Online learning

The treatment in lectures draws heavily on Chapter 11 of `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.

Lecture notes from an earlier version of this course: Survey chapter on online convex optimization: The following paper introduced the online convex optimization formulation. Logarithmic regret for strongly convex functions. The following paper gives some results on the relationship between prediction in adversarial and probabilistic settings. The simple online-to-batch conversion described in the lecture is based on Theorem 5 from this paper. Optimal regret and the formulation of the dual game, plus the proof of the upper bound in terms of sequential Rademacher averages. See also this paper, which has the lower bound for online learning with absolute loss in terms of sequential Rademacher averages.
Kernel Methods, Support Vector Machines

The following two survey papers give nice overviews of kernel methods.
(Note that Section 2 in both papers provides some worthwhile intuition, but the theorems are only superficially related to kernel methods.) The following papers introduce the hard margin and soft margin SVMs.
See also the text books: The following text book gives a good treatment of constrained optimization problems and Lagrangian duality (see Chapter 5). It is available on the web. The original representer theorem. The use of online convex optimization methods (stochastic gradient descent) for fast approximate computation of the SVM QP was first investigated in this paper: The following papers present relationships between convex cost functions and discrete loss (for two-class pattern classification). The first paper generalizes and simplifies results of the second. The third paper considers more general decision-theoretic problems, including weighted classification, regression, quantile estimation and density estimation.
Ensemble Methods
Early boosting papers. This paper contains the result in lectures about the relationship between the existence of weak learners and the existence of a large margin convex combination. It also contains bounds on the misclassification probability of a large margin classifier. Some surveys of boosting. Two other views of boosting algorithms. An extension of AdaBoost to real-valued base classifiers. The analysis in lectures of AdaBoost as an entropy projection method follows Chapter 8 of Schapire and Freund's text, which in turn is based on this paper, and references: The oracle inequality for complexity-regularized model selection is from this paper: This talk has some more on oracle inequalities and universal consistency: The universal consistency analysis in lectures for AdaBoost follows Chapter 12 of Schapire and Freund's text, which is based on these papers: The latter paper improved the algorithmic convergence rate bounds for AdaBoost over the bound from the following paper, which was the result used in the first proof of universal consistency above. The following paper shows that the universal approximation assumption made in that analysis is essential: with a class of base classifiers that is too simple, a broad family of methods based on convex surrogate loss minimization (including AdaBoost) will fail spectacularly on some simple noisy problems. The following paper extends the universal consistency result to the logistic loss.
See also the readings in this previous incarnation of the course.

Back to course home page