This page contains pointers to a collection of optional readings, in case you wish to delve further into the topics covered in lectures.

`A Probabilistic Theory of Pattern Recognition,' L. Devroye, L. Gyorfi, G. Lugosi, Springer, New York, 1996.

`Statistical Learning Theory,' Vladimir N. Vapnik, Wiley, 1998.

`Neural Network Learning: Theoretical Foundations,' Martin Anthony and Peter L. Bartlett, Cambridge University Press, 1999.

`An Elementary Introduction to Statistical Learning Theory,' Sanjeev Kulkarni and Gilbert Harman, Wiley, 2011.

`Foundations of Machine Learning,' Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, MIT Press, 2012.

This text book describes game-theoretic formulations of prediction problems:

`Prediction, Learning, and Games.' N. Cesa-Bianchi and G. Lugosi, Cambridge University Press, 2006.

See also the following review papers.

'Theory of Classification: a Survey of Recent Advances' Stephane Boucheron, Olivier Bousquet and Gabor Lugosi.

'Online learning and online convex optimization' Shai Shalev-Shwartz.

`A General Lower Bound on the Number of Examples Needed for Learning.' A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant.

The following are old (1987 and 1990) revisions of older (1969 and 1965, respectively) books on linear threshold functions, the perceptron algorithm, and the perceptron convergence theorem.

Perceptrons: An Introduction to Computational Geometry Marvin L. Minsky, Seymour A. Papert, MIT Press, 1987.

The Mathematical Foundations of Learning Machines, Nilsson, N., San Francisco: Morgan Kaufmann, 1990.

The upper bound on risk for the perceptron algorithm that we saw in lectures follows from the perceptron convergence theorem and results converting mistake bounded algorithms to average risk bounds. The following paper reviews these results.

Large margin classification using the perceptron algorithm. Yoav Freund and Robert E. Schapire.

Metric Characterization of Random Variables and Random Processes. V.~Buldygin and I.~Kozachenko. American Mathematical Society, 2000.

Concentration Inequalities: A Nonasymptotic Theory of Independence. S. Boucheron, G. Lugosi and P. Massart. Oxford University Press, 2013.

See also this paper.

`Concentration inequalities.' S. Boucheron, O. Bousquet and G. Lugosi.

The Hoeffding-Azuma inequality:

`Weighted sums of certain dependent random variables.' Kazuoki Azuma. 1967.

`Probability inequalities for sums of bounded random variables.' Wassily Hoeffding. 1963.

`On the method of bounded differences.' Colin McDiarmid. 1989.

`Rademacher and Gaussian complexities: risk bounds and structural results' P. L. Bartlett and S. Mendelson.

'Model selection and error estimation.' P. Bartlett, S. Boucheron and G. Lugosi.

'Rademacher penalties and structural risk minimization.' Vladimir Koltchinskii.

Rademacher averages for large margin classifiers:

`Empirical margin distributions and bounding the generalization error of combined classifiers.' Vladimir Koltchinskii and Dmitriy Panchenko.

The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2 in this paper, which also introduces local Rademacher averages.

`Some applications of concentration inequalities to statistics.' Pascal Massart.

The contraction inequality in Lecture 8 is Corollary 3.17 in this book.

Probability in Banach Spaces: Isoperimetry and Processes. M. Ledoux and M. Talagrand. Springer, 1991.

The growth function, VC-dimension, and pseudodimension are described in the following text (see chapter 3). Estimates of these quantities for parameterized function classes is covered in chapters 7 and 8.

`Neural network learning: Theoretical foundations.' Martin Anthony and Peter Bartlett. Cambridge University Press. 1999.

- `Aggregating strategies.' V. Vovk.
- `How to use expert advice.' N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth.
- `On prediction of individual sequences.' N. Cesa-Bianchi and G. Lugosi

- `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.

- `When Random Play is Optimal Against an Adversary.' Jacob Abernethy, Manfred K. Warmuth and Joel Yellin. (COLT 2008.)

- `Efficient Algorithms for On-line Optimization.' Adam Tauman Kalai and Santosh Vempala. (Journal of Computer and System Sciences 71(3): 291-307, 2005.)
- `Adaptive Online Prediction by Following the Perturbed Leader.' Marcus Hutter and Jan Poland. (Journal of Machine Learning Research 6: 639-660, 2005.)

- `Tracking the best expert.' Mark Herbster and Manfred K. Warmuth (Machine Learning 32(2), 151-178, 1998.)
- `Efficient learning algorithms for changing environments.' Elad Hazan and C. Seshadhri (ICML p. 50, 2009.)
- `A closer look at adaptive regret.' Dmitry Adamskiy, Wouter M. Koolen, Alexey Chernov, and Vladimir Vovk. (ALT 290-304, 2012.)

- `Universal sequential coding of single messages.' Yuri M. Shtarkov (Problems of Information Transmission, vol. 23, no. 3, pp. 175-186, 1987)
- Section 9.4 of `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.

- `Universal portfolios.' T. M. Cover (Mathematical finance, vol. 1, no.pp. 1-29, 1991.)
- Chapter 10 of `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.
- `Algorithms for Portfolio Management based on the Newton Method.' Amit Agarwal and Elad Hazan and Satyen Kale and Robert E. Schapire. (Proceedings of the 23rd ICML, 2006.)

- `The context-tree weighting method: Basic properties.' Frans M. J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. (IEEE Transactions on Information Theory, pp 653-664, 1995.)
- The CTW website http://www.ele.tue.nl/ctw/
- `Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition.' Ron Begleiter and Ran El-Yaniv. (Journal of Machine Learning Research 7 pp 379-411, 2006.)

- `Aggregating strategies.' V. Vovk.
- `Mixability is Bayes Risk Curvature Relative to Log Loss.' Tim van Erven, Mark D. Reid and Robert C. Williamson. (Journal of Machine Learning Research, vol. 13, pp. 1639-1663, 2012.)

- `Using and combining predictors that specialize.' Yoav Freund, Robert E. Schapire, Yoram Singer, and Manfred K. Warmuth. (STOC pp 334-343, 1997.)
- `Prediction with expert evaluators' advice.' Alexey Chernov and Vladimir Vovk. (ALT pp 8-22, 2009.)
- `Putting Bayes to sleep.' Wouter M. Koolen, Dmitri Adamskiy, and Manfred K. Warmuth. (NIPS 25, pp 135-143, 2012.)

The treatment in lectures draws heavily on Chapter 11 of `Prediction, learning, and games.' N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006.

Lecture notes from an earlier version of this course:

- Lecture notes on online learning A. Rakhlin.

- Online learning and online convex optimization S. Shalev-Shwartz.

- `Logarithmic regret algorithms for online convex optimization' E. Hazan, A. Agarwal, and S. Kale.

- `On the generalization ability of on-line learning algorithms.' N. Cesa-Bianchi, A. Conconi, and C. Gentile.

- Exploiting random walks for learning, P. L. Bartlett, P. Fischer, K.-U. Hoeffgen, COLT 1994.

- A Stochastic View of Optimal Regret through Minimax Duality J. Abernethy, A. Agarwal, P. L. Bartlett, A. Rakhlin.

- Online Learning: Random Averages, Combinatorial Parameters, and Learnability A. Rakhlin, K. Sridharan, A. Tewari.

The following two survey papers give nice overviews of kernel methods.

(Note that Section 2 in both papers provides some worthwhile intuition, but the theorems are only superficially related to kernel methods.)

- `An introduction to kernel-based learning algorithms.' K.-R. Mueller, S. Mika, G. Raetsch, K. Tsuda, and B. Schoelkopf.
- `A Tutorial on Support Vector Machines for Pattern Recognition.' C. J. C. Burges.

- `A training algorithm for optimal margin classifiers.' B. Boser, I. Guyon, and V. Vapnik.
- `Support Vector Networks.' C. Cortes and V. Vapnik.

See also the text books:

- `An Introduction to Support Vector Machines.' N. Cristianini and J. Shawe-Taylor. Cambridge University Press, Cambridge, UK, 2000.
- `Kernel Methods for Pattern Analysis.' J. Shawe-Taylor and N. Cristianini. Cambridge University Press, Cambridge, UK, 2004.
- `Learning with Kernels.' B. Schoelkopf and A. Smola. MIT Press, Cambridge, MA, 2002.
- `Support Vector Machines.' I. Steinwart and A. Christmann. Springer, 2008.

- `Convex Optimization.' S. Boyd and L. Vandenberghe.

- `Some Results on Tchebycheffian Spline Functions' G. Kimeldorf and G. Wahba.

- Pegasos: Primal Estimated sub-GrAdient SOlver for SVM S. Shalev-Shwartz, Y. Singer, N. Srebro, A. Cotter.

- `Convexity, classification, and risk bounds.' Peter Bartlett, Mike Jordan and Jon McAuliffe.
- `Statistical behavior and consistency of classification methods based on convex risk minimization.' Tong Zhang.
- `How to compare different loss functions and their risks.' Ingo Steinwart.

- `A decision-theoretic generalization of on-line learning and an application to boosting.' Yoav Freund and Robert E. Schapire.
- `Experiments with a new boosting algorithm.' Yoav Freund and Robert E. Schapire.

- `Boosting the margin: A new explanation for the effectiveness of voting methods.' Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee.

- `The boosting approach to machine learning: An overview.' Robert E. Schapire.
- `Boosting: Foundations and Algorithms.' Robert E. Schapire and Yoav Freund. MIT Press. 2012.

- `Arcing classifiers.' Leo Breiman.
- `Additive logistic regression: a statistical view of boosting.' Jerome Friedman, Trevor Hastie and Robert Tibshirani.

- `Improved boosting algorithms using confidence-rated predictions' R. E. Schapire and Y. Singer.

- `Logistic Regression, AdaBoost and Bregman Distances.' M. Collins, R. E. Schapire and Y. Singer.

- `Model selection and error estimation' Peter L. Bartlett, Stéphane Boucheron and Gábor Lugosi.

- `Regression Methods for Pattern Classification: Statistical Properties of Large Margin Classifiers.' Peter Bartlett.

- `AdaBoost is Consistent.' Peter L. Bartlett and Mikhail Traskin
- `The Rate of Convergence of AdaBoost.' Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

- `Some Theory for Generalized Boosting Algorithms.' Peter J. Bickel, Ya’acov Ritov, and Alon Zakai.

- `Random Classiﬁcation Noise Defeats All Convex Potential Boosters.' Philip M. Long and Rocco A. Servedio

- `Boosting with the Logistic Loss is Consistent.' Matus Telgarsky.

See also the readings in this previous incarnation of the course.

Back to course home page