CS 281B / Stat 241B, Spring 2014:
Statistical Learning Theory
Readings
This page contains pointers to a collection of optional readings,
in case you wish to delve further into the topics covered in lectures.
Probabilistic and game-theoretic formulation of prediction problems
The following text books describe probabilistic formulations of
prediction problems:
`A Probabilistic Theory of Pattern Recognition,'
L. Devroye, L. Gyorfi, G. Lugosi,
Springer, New York, 1996.
`Statistical Learning Theory,'
Vladimir N. Vapnik,
Wiley, 1998.
`Neural Network Learning: Theoretical Foundations,'
Martin Anthony and Peter L. Bartlett,
Cambridge University Press, 1999.
`An Elementary Introduction to Statistical Learning Theory,'
Sanjeev Kulkarni and Gilbert Harman,
Wiley, 2011.
`Foundations of Machine Learning,'
Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar,
MIT Press, 2012.
This text book describes game-theoretic formulations of prediction
problems:
`Prediction, Learning, and Games.'
N. Cesa-Bianchi and G. Lugosi,
Cambridge University Press, 2006.
See also the following review papers.
'Theory of
Classification: a Survey of Recent Advances'
Stephane Boucheron, Olivier Bousquet and Gabor Lugosi.
'Online learning and online convex
optimization'
Shai Shalev-Shwartz.
Perceptron Algorithm
The argument giving the minimax lower bound for linear
threshold functions is similar to the proof of the main
result in the following paper.
`A General Lower Bound on the Number of Examples Needed for Learning.'
A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant.
The following are old (1987 and 1990) revisions of older (1969
and 1965, respectively) books
on linear threshold functions, the perceptron algorithm, and the
perceptron convergence theorem.
Perceptrons: An Introduction to Computational Geometry
Marvin L. Minsky, Seymour A. Papert,
MIT Press, 1987.
The Mathematical Foundations of Learning Machines,
Nilsson, N.,
San Francisco: Morgan Kaufmann, 1990.
The upper bound on risk for the perceptron algorithm that we saw
in lectures follows from
the perceptron convergence theorem and results converting mistake
bounded algorithms to average risk bounds.
The following paper reviews these results.
Large margin classification using the perceptron algorithm.
Yoav Freund and Robert E. Schapire.
Risk Bounds and Uniform Convergence
The following books give more detail on concentration inequalities.
Metric Characterization of Random Variables and Random Processes.
V.~Buldygin and I.~Kozachenko. American Mathematical Society, 2000.
Concentration Inequalities: A Nonasymptotic Theory of Independence.
S. Boucheron, G. Lugosi and P. Massart.
Oxford University Press, 2013.
See also this paper.
`Concentration inequalities.'
S. Boucheron, O. Bousquet and G. Lugosi.
The Hoeffding-Azuma inequality:
`Weighted sums of certain dependent random variables.'
Kazuoki Azuma. 1967.
`Probability inequalities for sums of bounded random variables.'
Wassily Hoeffding. 1963.
`On the method of bounded differences.'
Colin McDiarmid. 1989.
Rademacher averages
`A few notes on Statistical Learning Theory.'
Shahar Mendelson.
`Rademacher and Gaussian complexities:
risk bounds and structural results'
P. L. Bartlett and S. Mendelson.
'Model selection and error estimation.'
P. Bartlett, S. Boucheron and G. Lugosi.
'Rademacher penalties and structural risk minimization.'
Vladimir Koltchinskii.
Rademacher averages for large margin classifiers:
`Empirical margin distributions and bounding the generalization error
of combined classifiers.'
Vladimir Koltchinskii and Dmitriy Panchenko.
The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2
in this paper, which also introduces local Rademacher averages.
`Some applications of concentration inequalities to statistics.'
Pascal Massart.
The contraction inequality in Lecture 8 is Corollary 3.17 in this
book.
Probability in Banach Spaces: Isoperimetry and Processes.
M. Ledoux and M. Talagrand. Springer, 1991.
The growth function, VC-dimension, and pseudodimension
are described in the following text (see chapter 3).
Estimates of these quantities for parameterized
function classes is covered in chapters 7 and 8.
`Neural
network learning: Theoretical foundations.'
Martin Anthony and Peter Bartlett. Cambridge University Press. 1999.
Online learning
Early papers on prediction of individual sequences:
See also the text book:
-
`Prediction, learning, and games.'
N. Cesa-Bianchi and G. Lugosi.
Cambridge University Press, 2006.
The minimax algorithm for dot loss was described in:
Follow the perturbed leader was described in the following papers:
Tracking and adaptive regret are covered in:
Normalized Maximum Likelihood:
-
`Universal sequential coding of single messages.'
Yuri M. Shtarkov
(Problems of Information Transmission, vol. 23, no. 3, pp. 175-186, 1987)
-
Section 9.4 of
`Prediction, learning, and games.'
N. Cesa-Bianchi and G. Lugosi.
Cambridge University Press, 2006.
Universal portfolios:
-
`Universal portfolios.'
T. M. Cover (Mathematical finance, vol. 1, no.pp. 1-29, 1991.)
-
Chapter 10 of
`Prediction, learning, and games.'
N. Cesa-Bianchi and G. Lugosi.
Cambridge University Press, 2006.
-
`Algorithms for Portfolio Management based on the Newton Method.'
Amit Agarwal and Elad Hazan and Satyen Kale and Robert E. Schapire.
(Proceedings of the 23rd ICML, 2006.)
Context Tree Weighting:
Mixable losses:
Specialists:
Online learning
The treatment in lectures draws heavily on Chapter 11 of
`Prediction, learning, and games.'
N. Cesa-Bianchi and G. Lugosi.
Cambridge University Press, 2006.
Lecture notes from an earlier version of this course:
Survey chapter on online convex optimization:
The following paper introduced the online convex optimization
formulation.
Logarithmic regret for strongly convex functions.
The following paper gives
some results on the relationship between prediction in adversarial and
probabilistic settings.
The simple online-to-batch conversion described in the lecture is
based on Theorem 5 from this paper.
Optimal regret and the formulation of the dual game, plus the proof of
the upper bound in terms of sequential Rademacher averages.
See also this paper, which has the lower bound for online
learning with absolute loss in terms of
sequential Rademacher averages.
Kernel Methods, Support Vector Machines
The following two survey papers give nice overviews of
kernel methods.
(Note that Section 2 in both papers provides some worthwhile intuition,
but the theorems are only superficially related to kernel methods.)
The following papers introduce the hard margin and soft margin SVMs.
See also the text books:
-
`An Introduction to Support Vector Machines.'
N. Cristianini and J. Shawe-Taylor.
Cambridge University Press, Cambridge, UK, 2000.
-
`Kernel Methods for Pattern Analysis.'
J. Shawe-Taylor and N. Cristianini.
Cambridge University Press, Cambridge, UK, 2004.
-
`Learning with Kernels.'
B. Schoelkopf and A. Smola.
MIT Press, Cambridge, MA, 2002.
-
`Support Vector Machines.'
I. Steinwart and A. Christmann.
Springer, 2008.
The following text book gives a good treatment of
constrained optimization problems and Lagrangian duality
(see Chapter 5). It is available on the web.
The original representer theorem.
The use of online convex optimization methods (stochastic gradient descent)
for fast approximate computation of the SVM QP was first investigated in this paper:
The following papers present relationships between convex cost
functions and discrete loss (for two-class pattern classification).
The first paper generalizes and simplifies results of the second.
The third paper considers more general decision-theoretic problems,
including weighted classification, regression, quantile estimation
and density estimation.
Ensemble Methods
Early boosting papers.
This paper contains the
result in lectures about the relationship between the existence of
weak learners and the existence of a large margin convex combination.
It also contains bounds on the misclassification probability of a
large margin classifier.
Some surveys of boosting.
Two other views of boosting algorithms.
An extension of AdaBoost to real-valued base classifiers.
The analysis in lectures of AdaBoost as an entropy projection method
follows Chapter 8 of Schapire and Freund's text, which in turn is based
on this paper, and references:
The oracle inequality for complexity-regularized model selection
is from this paper:
This talk has some more on oracle inequalities and universal consistency:
The universal consistency analysis in lectures for AdaBoost
follows Chapter 12 of Schapire and Freund's text, which is based on
these papers:
The latter paper improved the algorithmic convergence rate bounds for
AdaBoost over the bound from the following paper, which was the result
used in the first proof of universal consistency above.
The following paper shows that the universal approximation assumption
made in that analysis is essential: with a class of base classifiers that
is too simple, a broad family of methods based on convex surrogate loss
minimization (including AdaBoost) will fail spectacularly on some simple
noisy problems.
The following paper extends the universal consistency result
to the logistic loss.
See also the readings in this
previous incarnation of the course.
Back to course home page