[1]  P. L. Bartlett. Lower bounds on the VapnikChervonenkis dimension of multilayer threshold networks. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 144150. ACM Press, 1993. [ bib ] 
[2]  P. L. Bartlett. The sample size necessary for learning in multilayer networks. In Proceedings of the Fourth Australian Conference on Neural Networks, pages 1417, 1993. [ bib ] 
[3]  D. R. Lovell and P. L. Bartlett. Error and variance bounds in multilayer neural networks. In Proceedings of the Fourth Australian Conference on Neural Networks, pages 161164, 1993. [ bib ] 
[4]  P. L. Bartlett. VapnikChervonenkis dimension bounds for two and threelayer networks. Neural Computation, 5(3):371373, 1993. [ bib ] 
[5]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. The VapnikChervonenkis dimension of neural networks with restricted parameter ranges. In Proceedings of the Fifth Australian Conference on Neural Networks, pages 198201, 1994. [ bib ] 
[6]  P. L. Bartlett. Learning quantized realvalued functions. In Proceedings of Computing: the Australian Theory Seminar, pages 2435. University of Technology Sydney, 1994. [ bib ] 
[7]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VCdimension of smoothly parametrized function classes. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 362367. ACM Press, 1994. [ bib ] 
[8]  P. L. Bartlett, P. M. Long, and R. C. Williamson. Fatshattering and the learnability of realvalued functions. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 299310. ACM Press, 1994. [ bib ] 
[9]  P. L. Bartlett, P. Fischer, and K.U. Höffgen. Exploiting random walks for learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 318327. ACM Press, 1994. [ bib ] 
[10]  P. L. Bartlett. Computational learning theory. In A. Kent and J. G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 31, pages 8399. Marcel Dekker, 1994. [ bib ] 
[11]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fanin. In Proceedings of the Sixth Australian Conference on Neural Networks, pages 201204, 1995. [ bib ] 
[12]  P. L. Bartlett and R. C. Williamson. The sample complexity of neural network learning with discrete inputs. In Proceedings of the Sixth Australian Conference on Neural Networks, pages 189192, 1995. [ bib ] 
[13]  M. Anthony and P. L. Bartlett. Function learning from interpolation. In Computational Learning Theory: Second European Conference, EUROCOLT 95, Barcelona Spain, March 1995, Proceedings, pages 211221, 1995. [ bib ] 
[14]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. On efficient agnostic learning of linear combinations of basis functions. In Proceedings of the Eighth Annual ACM Conference on Computational Learning Theory, pages 369376. ACM Press, 1995. [ bib ] 
[15]  P. L. Bartlett and P. M. Long. More theorems about scale sensitive dimensions and learning. In Proceedings of the Eighth Annual ACM Conference on Computational Learning Theory, pages 392401. ACM Press, 1995. [ bib ] 
[16]  P. L. Bartlett and S. Dasgupta. Exponential convergence of a gradient descent algorithm for a class of recurrent neural networks. In Proceedings of the 38th Midwest Symposium on Circuits and Systems, 1995. [ bib ] 
[17]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VCdimension of smoothly parametrized function classes. Neural Computation, 7:9901002, 1995. (See also correction, Neural Computation, 9: 765769, 1997). [ bib ] 
[18]  L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Adaptive tracking identification: the art of defalsification. In Proceedings of the 1996 IFAC World Congress, 1996. [ bib ] 
[19]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 140146. ACM Press, 1996. [ bib ] 
[20]  J. ShaweTaylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A framework for structural risk minimization. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 6876. ACM Press, 1996. [ bib ] 
[21]  P. L. Bartlett, S. BenDavid, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 131139. ACM Press, 1996. [ bib ] 
[22]  L. Kammer, R. R. Bitmead, and P. L. Bartlett. Signalbased testing of LQoptimality of controllers. In Proceedings of the 35th IEEE Conference on Decision and Control, pages FA172, 36203623. IEEE, 1996. [ bib ] 
[23]  P. L. Bartlett and S. R. Kulkarni. The complexity of model classes, and smoothing noisy data (invited). In Proceedings of the 35th IEEE Conference on Decision and Control, pages TM094, 23122317. IEEE, 1996. [ bib ] 
[24]  A. Kowalczyk, J. Szymanski, P. L. Bartlett, and R. C. Williamson. Examples of learning curves from a modified VCformalism. In Advances in Neural Information Processing Systems 8, pages 344350, 1996. [ bib ] 
[25]  P. L. Bartlett and R. C. Williamson. The VapnikChervonenkis dimension and pseudodimension of twolayer neural networks with discrete inputs. Neural Computation, 8:653656, 1996. [ bib ] 
[26]  P. L. Bartlett, P. M. Long, and R. C. Williamson. Fatshattering and the learnability of realvalued functions. Journal of Computer and System Sciences, 52(3):434452, 1996. (special issue on COLT`94). [ bib ] 
[27]  M. Anthony, P. L. Bartlett, Y. Ishai, and J. ShaweTaylor. Valid generalisation from approximate interpolation. Combinatorics, Probability, and Computing, 5:191214, 1996. [ bib ] 
[28]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fanin. IEEE Transactions on Information Theory, 42(6):21182132, 1996. [ bib ] 
[29]  Peter L. Bartlett, Anthony Burkitt, and Robert C. Williamson, editors. Proceedings of the Seventh Australian Conference on Neural Networks. Australian National University, 1996. [ bib ] 
[30]  G. Loy and P. L. Bartlett. Generalization and the size of the weights: an experimental study. In Proceedings of the Eighth Australian Conference on Neural Networks, pages 6064, 1997. [ bib ] 
[31]  P. L. Bartlett. Neural network learning. (abstract of invited talk.). In CONTROL 97 Conference Proceedings, Institution of Engineers Australia, page 543, 1997. [ bib ] 
[32]  J. Baxter and P. L. Bartlett. A result relating convex nwidths to covering numbers with some applications to neural networks. In S. BenDavid, editor, Proceedings of the Third European Conference on Computational Learning Theory (EuroCOLT'97), pages 251259. Springer, 1997. [ bib ] 
[33]  P. L. Bartlett, T. Linder, and G. Lugosi. A minimax lower bound for empirical quantizer design. In S. BenDavid, editor, Proceedings of the Third European Conference on Computational Learning Theory (EuroCOLT'97), pages 220222. Springer, 1997. [ bib ] 
[34]  P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design (abstract). In Proceedings of the 1997 IEEE International Symposium on Information Theory, page 511, 1997. [ bib ] 
[35]  R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322330, 1997. [ bib ] 
[36]  P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9, pages 134140, 1997. [ bib ] 
[37]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. Correction to `lower bounds on the VCdimension of smoothly parametrized function classes'. Neural Computation, 9:765769, 1997. [ bib ] 
[38]  P. L. Bartlett. Book review: `Neural networks for pattern recognition,' Christopher M. Bishop. Statistics in Medicine, 16(20):23852386, 1997. [ bib ] 
[39]  P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for realvalued function classes. IEEE Transactions on Information Theory, 43(5):17211724, 1997. [ bib ] 
[40]  L. Mason, P. L. Bartlett, and M. Golea. Generalization in threshold networks, combined decision trees and combined mask perceptrons. In T. Downs, M. Frean, and M. Gallagher, editors, Proceedings of the Ninth Australian Conference on Neural Networks (ACNN'98), pages 8488. University of Queensland, 1998. [ bib ] 
[41]  B. Schölkopf, P. L. Bartlett, A. Smola, and R. Williamson. Support vector regression with automatic accuracy control. In L. Niklasson, M. Boden, and T. Ziemke, editors, Perspectives in Neural Computing: Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN'98), pages 111116. SpringerVerlag, 1998. [ bib ] 
[42]  L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. In Proceedings of the IEEE Conference on Decision and Control, volume 3, pages 28742879, 1998. [ bib ] 
[43]  M. Golea, P. L. Bartlett, and W. S. Lee. Generalization in decision trees and DNF: Does size matter? In Advances in Neural Information Processing Systems 10, pages 259265, 1998. [ bib ] 
[44]  J. Baxter and P. L. Bartlett. The canonical distortion measure in feature space and 1NN classification. In Advances in Neural Information Processing Systems 10, pages 245251, 1998. [ bib ] 
[45]  P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525536, 1998. [ bib ] 
[46]  P. L. Bartlett and P. M. Long. Prediction, learning, uniform convergence, and scalesensitive dimensions. Journal of Computer and System Sciences, 56(2):174190, 1998. (special issue on COLT`95). [ bib ] 
[47]  L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Optimal controller properties from closedloop experiments. Automatica, 34(1):8391, 1998. [ bib ] 
[48]  P. L. Bartlett and M. Vidyasagar. Introduction to the special issue on learning theory. Systems and Control Letters, 34:113114, 1998. [ bib ] 
[49]  P. L. Bartlett and S. Kulkarni. The complexity of model classes, and smoothing noisy data. Systems and Control Letters, 34(3):133140, 1998. [ bib ] 
[50]  P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):18021813, 1998. [ bib ] 
[51]  J. ShaweTaylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over datadependent hierarchies. IEEE Transactions on Information Theory, 44(5):19261940, 1998. [ bib ] 
[52]  W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):19741980, 1998. [ bib ] 
[53]  P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10(8):21592173, 1998. [ bib ] 
[54]  R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):16511686, 1998. [ bib ] 
[55]  Peter L. Bartlett and Yishay Mansour, editors. Proceedings of the Eleventh Annual Conference on Computational Learning Theory. ACM Press, 1998. [ bib ] 
[56]  P. L. Bartlett and J. Baxter. Voting methods for data segmentation. In Proceedings of the Advanced Investment Technology Conference, pages 3540. Bond University, 1999. [ bib ] 
[57]  L. Mason, P. L. Bartlett, and J. Baxter. Error bounds for voting classifiers using margin cost functions (invited abstract). In Proceedings of the IEEE Information Theory Workshop on Detection, Estimation, Classification and Imaging, page 36, 1999. [ bib ] 
[58]  P. L. Bartlett and S. BenDavid. Hardness results for neural network approximation problems. In Proceedings of the Fourth European Conference on Computational Learning Theory, pages 5062, 1999. [ bib ] 
[59]  Y. Guo, P. L. Bartlett, J. ShaweTaylor, and R. C. Williamson. Covering numbers for support vector machines. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 267277, 1999. [ bib ] 
[60]  T. Koshizen, P. L. Bartlett, and A. Zelinsky. Sensor fusion of odometry and sonar sensors by the Gaussian mixture Bayes' technique in mobile robot position estimation. In Proceedings of the 1999 IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages 742747, 1999. [ bib ] 
[61]  B. Schölkopf, P. L. Bartlett, A. Smola, and R. Williamson. Shrinking the tube: a new support vector regression algorithm. In Advances in Neural Information Processing Systems 11, pages 330336, 1999. [ bib ] 
[62]  L. Mason, P. L. Bartlett, and J. Baxter. Direct optimization of margins improves generalization in combined classifiers. In Advances in Neural Information Processing Systems 11, pages 288294, 1999. [ bib ] 
[63]  P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. In Advances in Neural Information Processing Systems 11, pages 190196, 1999. [ bib ] 
[64]  P. L. Bartlett and J. ShaweTaylor. Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods  Support Vector Learning, pages 4354. MIT Press, 1999. [ bib ] 
[65]  P. L. Bartlett. Efficient neural network learning. In V. D. Blondel, E. D. Sontag, M. Vidyasagar, and J. C. Willems, editors, Open Problems in Mathematical Systems Theory and Control, pages 3538. Springer Verlag, 1999. [ bib ] 
[66]  P. L. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44(1):5562, 1999. [ bib ] 
[67]  Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 404pp. ISBN 978052157353X. Reprinted 2001, 2002. Paperback edition 2009; ISBN 9780521118620. [ bib  .html ] 
[68]  P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradientbased reinforcement learning. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 133141, 2000. [ bib ] 
[69]  P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 286297, 2000. [ bib ] 
[70]  J. Baxter and P. L. Bartlett. GPOMDP: An online algorithm for estimating performance gradients in POMDP's, with applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 4148, 2000. [ bib ] 
[71]  L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12, pages 512518, 2000. [ bib ] 
[72]  J. Baxter and P. L. Bartlett. Direct gradientbased reinforcement learning (invited). In Proceedings of the International Symposium on Circuits and Systems, pages III271274, 2000. [ bib ] 
[73]  P. L. Bartlett and J. Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, volume 1, pages 124129, 2000. [ bib ] 
[74]  L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 38(3):243255, 2000. [ bib ] 
[75]  L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. Automatica, 36(9):13011307, 2000. [ bib ] 
[76]  B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12(5):12071245, 2000. [ bib ] 
[77]  S. Parameswaran, M. F. Parkinson, and P. L. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):12631274, 2000. [ bib ] 
[78]  P. L. Bartlett, S. BenDavid, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41(2):153174, 2000. [ bib ] 
[79]  A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Introduction to large margin classifiers. In Advances in Large Margin Classifiers, pages 129. MIT Press, 2000. [ bib ] 
[80]  L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221246. MIT Press, 2000. [ bib ] 
[81]  M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9:213225, 2000. [ bib ] 
[82]  Alexander J. Smola, Peter L. Bartlett, Bernard Schölkopf, and Dale Schuurmans, editors. Advances in Large Margin Classifiers. MIT Press, 2000. [ bib ] 
[83]  A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619625, 2001. [ bib ] 
[84]  P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory and Fifth European Conference on Computational Learning Theory, pages 224240, 2001. [ bib ] 
[85]  A. BenHur, T. Barnes, P. L. Bartlett, O. Chapelle, A. Elisseeff, H. Fristche, I. Guyon, B. Schölkopf, J. Weston, E. Fung, C. Enderwick, E. A. Dalmasso, B.L. Adam, J. W. Davis, A. Vlahou, L. Cazares, M. Ward, P. F. Schellhammer, J. Semmes, and G. L. Wright. Application of support vector machines to the classification of proteinchip system mass spectral data of prostate cancer serum samples (abstract). In Second Annual National Cancer Institute Early Detection Research Network Scientific Workshop, 2001. [ bib ] 
[86]  J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinitehorizon, policygradient estimation. Journal of Artificial Intelligence Research, 15:351381, 2001. [ bib  .html ] 
[87]  J. Baxter and P. L. Bartlett. Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319350, 2001. [ bib  .html ] 
[88]  E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14, pages 15071514, 2002. [ bib  .ps.gz ] 
[89]  G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. In Proceedings of the International Conference on Machine Learning, pages 323330, 2002. [ bib ] 
[90]  P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized Rademacher complexity. In Proceedings of the Conference on Computational Learning Theory, pages 4458, 2002. [ bib ] 
[91]  L. Mason, P. L. Bartlett, and M. Golea. Generalization error of combined classifiers. Journal of Computer and System Sciences, 65(2):415438, 2002. [ bib  http ] 
[92]  P. L. Bartlett, P. Fischer, and K.U. Höffgen. Exploiting random walks for learning. Information and Computation, 176(2):121135, 2002. [ bib  http ] 
[93]  P. L. Bartlett and S. BenDavid. Hardness results for neural network approximation problems. Theoretical Computer Science, 284(1):5366, 2002. (special issue on Eurocolt'99). [ bib  http ] 
[94]  P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradientbased reinforcement learning. Journal of Computer and System Sciences, 64(1):133150, 2002. [ bib ] 
[95]  P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85113, 2002. [ bib  .ps.gz ] 
[96]  P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463482, 2002. [ bib  .pdf ] 
[97]  Y. Guo, P. L. Bartlett, J. ShaweTaylor, and R. C. Williamson. Covering numbers for support vector machines. IEEE Transactions on Information Theory, 48(1):239250, 2002. [ bib ] 
[98]  Peter L. Bartlett. An introduction to reinforcement learning theory: value function methods. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600, pages 184202. Springer, 2003. [ bib ] 
[99]  Peter L. Bartlett and Wolfgang Maass. VapnikChervonenkis dimension of neural nets. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 11881192. MIT Press, 2003. Second Edition. [ bib  .ps.gz  .pdf ] 
[100] 
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe.
Convexity, classification, and risk bounds.
Technical Report 638, Department of Statistics, U.C. Berkeley, 2003.
[ bib 
.ps.Z 
.pdf ]
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finitedimensional base class, with a variety of commonly used loss functions.

[101] 
Peter L. Bartlett.
Prediction algorithms: complexity, concentration and convexity.
In Proceedings of the 13th IFAC Symposium on System
Identification, pages 15071517, 2003.
[ bib 
.ps.Z ]
In this paper, we review two families of algorithms used to estimate largescale statistical models for prediction problems, kernel methods and boosting algorithms. We focus on the computational and statistical properties of prediction algorithms of this kind. Convexity plays an important role for these algorithms, since they exploit the computational advantages of convex optimization procedures. However, in addition to its computational advantages, the use of convexity in these methods also confers some attractive statistical properties. We present some recent results that show the advantages of convexity for estimation rates, the rates at which the prediction accuracies approach their optimal values. In addition, we present results that quantify the cost of using a convex loss function in place of the real loss function of interest.

[102] 
Peter L. Bartlett, Shahar Mendelson, and Petra Philips.
Local complexities for empirical risk minimization.
In Proceedings of the 17th Annual Conference on Computational
Learning Theory (COLT2004), volume 3120, pages 270284. Springer, 2004.
[ bib 
.ps.gz 
.pdf ]
We present sharp bounds on the risk of the empirical minimization algorithm under mild assumptions on the class. We introduce the notion of isomorphic coordinate projections and show that this leads to a sharper error bound than the best previously known. The quantity which governs this bound on the empirical minimizer is the fixed point of the function (r) = {f  _{n} f: finF, f = r }. We prove that this is the best estimate one can obtain using `structural results', and that it is possible to estimate the error rate from data. We then prove that the bound on the empirical minimization algorithm can be improved further by a direct analysis, and that the correct error rate is the maximizer of (r) r, where (r) = {f  _{n} f: finF, f = r }.

[103] 
Peter L. Bartlett and Ambuj Tewari.
Sparseness vs estimating conditional probabilities: Some asymptotic
results.
In Proceedings of the 17th Annual Conference on Learning
Theory, volume 3120, pages 564578. Springer, 2004.
[ bib 
.ps.gz 
.pdf ]
One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic bounds for the number of support vectors. This enables us to characterize the exact tradeoff between sparseness and the ability to estimate conditional probabilities for these loss functions.

[104] 
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe.
Large margin classifiers: convex loss, low noise, and convergence
rates.
In Advances in Neural Information Processing Systems, 16, 2004.
[ bib 
.ps.gz ]
Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss functionthat it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finitedimensional base class.

[105]  Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The Annals of Statistics, 32(1):8591, 2004. [ bib  .ps.Z  .pdf ] 
[106]  E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:14711530, 2004. [ bib  .pdf ] 
[107]  G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:2772, 2004. [ bib  .ps.gz  .pdf ] 
[108] 
Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson.
Local Rademacher complexities.
Annals of Statistics, 33(4):14971537, 2005.
[ bib 
.ps 
.pdf ]
We propose new bounds on the error of learning algorithms in terms of a datadependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to prediction with bounded loss, and to regression with a convex loss function and a convex function class.

[109]  Rafael JiménezRodriguez, Nicholas Sitar, and Peter L. Bartlett. Maximum likelihood estimation of trace length distribution parameters using the EM algorithm. In G. Barla and M. Barla, editors, Prediction, Analysis and Design in Geomechanical Applications: Proceedings of the Eleventh International Conference on Computer Methods and Advances in Geomechanics (IACMAG2005), volume 1, pages 619626, Bologna, 2005. Pàtron Editore. [ bib ] 
[110] 
Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester.
Exponentiated gradient algorithms for largemargin structured
classification.
In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors,
Advances in Neural Information Processing Systems 17, pages 113120,
Cambridge, MA, 2005. MIT Press.
[ bib 
.ps.gz 
.pdf ]
We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of both Markov random fields and weighted contextfree grammars as special cases. We describe an algorithm that solves the largemargin optimization problem defined by Taskar et al, using an exponentialfamily (Gibbs distribution) representation of structured objects. The algorithm is efficient  even in cases where the number of labels y is exponential in size  provided that certain expectations under Gibbs distributions can becalculated efficiently. The optimization method we use for structured labels relies on a more general result, specifically the application of exponentiated gradient (EG) updates to quadratic programs (QPs). We describe a new method for solving QPs based on these techniques, and give bounds on its rate of convergence. In addition to their application to the structuredlabels task, the EG updates lead to simple algorithms for optimizing “conventional” binary or multiclass SVM problems. Finally, we give a new generalization bound for structured classification, using PACBayesian methods for the analysis of large margin classifiers.

[111]  Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. In Proceedings of the 18th Annual Conference on Learning Theory, volume 3559, pages 143157. Springer, 2005. [ bib  .pdf ] 
[112]  Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization” by V. Koltchinskii. The Annals of Statistics, 34(6):26572663, 2006. [ bib ] 
[113]  Peter L. Bartlett and Mikhail Traskin. Adaboost and other large margin classifiers: Convexity in pattern classification. In Proceedings of the 5th Workshop on Defence Applications of Signal Processing, 2006. [ bib ] 
[114]  Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341346, 2006. [ bib ] 
[115]  Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Technical report, U. C. Berkeley, 2006. [ bib ] 
[116] 
Peter L. Bartlett and Marten H. Wegkamp.
Classification with a reject option using a hinge loss.
Technical report, U.C. Berkeley, 2006.
[ bib 
.ps.gz 
.pdf ]
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate lossthe friskalso minimizes the risk. We also study the rate at which the frisk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1X) is unlikely to be close to certain critical values.

[117] 
Peter L. Bartlett and Shahar Mendelson.
Empirical minimization.
Probability Theory and Related Fields, 135(3):311334, 2006.
[ bib 
.ps.gz 
.pdf ]
We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes.

[118] 
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe.
Convexity, classification, and risk bounds.
Journal of the American Statistical Association,
101(473):138156, 2006.
(Was Department of Statistics, U.C. Berkeley Technical Report number
638, 2003).
[ bib 
.ps.gz 
.pdf ]
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 01 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 01 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finitedimensional base class, with a variety of commonly used loss functions.

[119] 
Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L.
Bartlett.
Exponentiated gradient algorithms for conditional random fields and
maxmargin Markov networks.
Technical report, U.C. Berkeley, 2007.
[ bib 
.pdf ]
Loglinear and maximummargin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the loglinear or maxmargin objective function; the dual in both the loglinear and maxmargin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the maxmargin case, O(1ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for loglinear models only O(log( 1ε)) updates are required. For both the maxmargin and loglinear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to LBFGS and stochastic gradient descent for loglinear models, and to SVMStruct for maxmargin models. The algorithms are applied to multiclass problems as well as a more complex largescale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

[120]  Jacob Duncan Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. Technical Report UCB/EECS200720, EECS Department, University of California, Berkeley, 2007. [ bib  .html ] 
[121] 
Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari.
Minimax lower bounds for online convex games.
Technical report, UC Berkeley, 2007.
[ bib 
.pdf ]
A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction x from a convex set, the environment plays a loss function f, and the learner's longterm goal is to minimize regret. Algorithms have been proposed by Zinkevich, when f is assumed to be convex, and Hazan et al, when f is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving lower bounds in each case. These results prove that the existing algorithms are essentially optimal.

[122] 
Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein.
Shifting: oneinclusion mistake bounds and sample compression.
Technical report, EECS Department, University of California,
Berkeley, 2007.
[ bib 
.pdf ]
We present new expected risk bounds for binary and multiclass prediction, and resolve several recent conjectures on sample compressibility due to Kuzmin and Warmuth. By exploiting the combinatorial structure of concept class F, Haussler et al. achieved a VC(F)/n bound for the natural oneinclusion prediction strategy. The key step in their proof is a d=VC(F) bound on the graph density of a subgraph of the hypercube  oneinclusion graph. The first main result of this report is a density bound of n choose(n1,<=d1)/choose(n,<=d) < d, which positively resolves a conjecture of Kuzmin and Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved oneinclusion mistake bound. The proof uses a new form of VCinvariant shifting and a grouptheoretic symmetrization. Our second main result is an algebraic topological property of maximum classes of VCdimension d as being dcontractible simplicial complexes, extending the wellknown characterization that d=1 maximum classes are trees. We negatively resolve a minimum degree conjecture of Kuzmin and Warmuth  the second part to a conjectured proof of correctness for Peeling  that every class has oneinclusion minimum degree at most its VCdimension. Our final main result is a kclass analogue of the d/n mistake bound, replacing the VCdimension by the Pollard pseudodimension and the oneinclusion strategy by its natural hypergraph generalization. This result improves on known PACbased expected risk bounds by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the oneinclusion (hyper)graph and is a running theme throughout.

[123] 
Peter L. Bartlett and Mikhail Traskin.
Adaboost is consistent.
Journal of Machine Learning Research, 8:23472368, 2007.
[ bib 
.pdf 
.pdf ]
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1a) iterationsfor sample size n and 0<a<1the sequence of risks of the classifiers it produces approaches the Bayes risk.

[124] 
Alexander Rakhlin, Jacob Abernethy, and Peter L. Bartlett.
Online discovery of similarity mappings.
In Proceedings of the 24th International Conference on Machine
Learning (ICML2007), pages 767774, 2007.
[ bib ]
We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the `Online Prediction with Expert Advice' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.

[125] 
Jacob Abernethy, Peter L. Bartlett, and Alexander Rakhlin.
Multitask learning with expert advice.
In Proceedings of the Conference on Learning Theory, pages
484498, 2007.
[ bib ]
We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the `ideal' algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient randomized algorithm based on Markov chain Monte Carlo techniques.

[126]  Ambuj Tewari and Peter L. Bartlett. Bounded parameter Markov decision processes with average reward criterion. In Proceedings of the Conference on Learning Theory, pages 263277, 2007. [ bib ] 
[127] 
Peter L. Bartlett and Mikhail Traskin.
Adaboost is consistent.
In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances
in Neural Information Processing Systems 19, pages 105112, Cambridge, MA,
2007. MIT Press.
[ bib 
.pdf ]
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1a) iterationsfor sample size n and a>0the sequence of risks of the classifiers it produces approaches the Bayes risk.

[128] 
David Rosenberg and Peter L. Bartlett.
The Rademacher complexity of coregularized kernel classes.
In Marina Meila and Xiaotong Shen, editors, Proceedings of the
Eleventh International Conference on Artificial Intelligence and Statistics,
volume 2, pages 396403, 2007.
[ bib 
.pdf ]
In the multiview approach to semisupervised learning, we choose one predictor from each of multiple hypothesis classes, and we `coregularize' our choices by penalizing disagreement among the predictors on the unlabeled data. In this paper we examine the coregularization method used in the recently proposed coregularized least squares (CoRLS) algorithm. In this method we have two hypothesis classes, each a reproducing kernel Hilbert space (RKHS), and we coregularize by penalizing the average squared difference in predictions on the unlabeled data. We get our final predictor by taking the pointwise average of the predictors from each view. We call the set of predictors that can result from this procedure the coregularized hypothesis class. The main result of this paper is a tight bound on the Rademacher complexity of the coregularized hypothesis class in terms of the kernel matrices of each RKHS. We find that the coregularization reduces the Rademacher complexity of the hypothesis class by an amount depending on how different the two views are, measured by a data dependent metric. We then use standard techniques to bound the gap between training error and test error for the CoRLS algorithm. Experimentally, we find that the amount of reduction in complexity introduced by coregularization correlates with the amount of improvement that coregularization gives in the CoRLS algorithm

[129]  Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting, oneinclusion mistake bounds and tight multiclass expected risk bounds. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 11931200, Cambridge, MA, 2007. MIT Press. [ bib  .pdf ] 
[130]  Peter L. Bartlett and Ambuj Tewari. Sample complexity of policy search with known dynamics. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 97104, Cambridge, MA, 2007. MIT Press. [ bib  .pdf ] 
[131] 
Peter L. Bartlett.
Fast rates for estimation error and oracle inequalities for model
selection.
Technical Report 729, Department of Statistics, U.C. Berkeley, 2007.
[ bib 
.pdf ]
We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.

[132] 
Peter L. Bartlett, Shahar Mendelson, and Petra Philips.
Optimal samplebased estimates of the expectation of the empirical
minimizer.
Technical report, U.C. Berkeley, 2007.
[ bib 
.ps.gz 
.pdf ]
We study samplebased estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely datadependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bmem03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.

[133]  Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775790, April 2007. [ bib  .html ] 
[134]  Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:10071025, May 2007. (Invited paper). [ bib  .html ] 
[135] 
Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett.
Matrix regularization techniques for online multitask learning.
Technical Report UCB/EECS2008138, EECS Department, University of
California, Berkeley, 2008.
[ bib 
.pdf ]
In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.

[136] 
Peter L. Bartlett.
Fast rates for estimation error and oracle inequalities for model
selection.
Econometric Theory, 24(2):545552, April 2008.
(Was Department of Statistics, U.C. Berkeley Technical Report number
729, 2007).
[ bib 
DOI 
.pdf ]
We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.

[137] 
Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L.
Bartlett.
Exponentiated gradient algorithms for conditional random fields and
maxmargin Markov networks.
Journal of Machine Learning Research, 9:17751822, August
2008.
[ bib 
.pdf ]
Loglinear and maximummargin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the loglinear or maxmargin objective function; the dual in both the loglinear and maxmargin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the maxmargin case, O(1ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for loglinear models only O(log( 1ε)) updates are required. For both the maxmargin and loglinear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to LBFGS and stochastic gradient descent for loglinear models, and to SVMStruct for maxmargin models. The algorithms are applied to multiclass problems as well as a more complex largescale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

[138] 
Peter L. Bartlett and Marten H. Wegkamp.
Classification with a reject option using a hinge loss.
Journal of Machine Learning Research, 9:18231840, August
2008.
[ bib 
.pdf ]
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate lossthe friskalso minimizes the risk. We also study the rate at which the frisk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1X) is unlikely to be close to certain critical values.

[139]  Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Overcoming the complexity of diagnostic problems due to sensor network architecture. In K. Grigoriadis, editor, Proceedings of Intelligent Systems and Control (ISC 2008), pages 633071, September 2008. [ bib ] 
[140]  Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Fault diagnostics and supervised testing: How fault diagnostic tools can be proactive? In K. Grigoriadis, editor, Proceedings of Intelligent Systems and Control (ISC 2008), pages 633034, September 2008. [ bib ] 
[141]  Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, September 2008. [ bib  .pdf ] 
[142] 
Ambuj Tewari and Peter L. Bartlett.
Optimistic linear programming gives logarithmic regret for
irreducible MDPs.
In John Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors,
Advances in Neural Information Processing Systems 20, pages 15051512,
Cambridge, MA, September 2008. MIT Press.
[ bib 
.pdf ]
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of nextstate transition probabilities that are close to the estimates: a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P)logT of the reward obtained by the optimal policy, where C(P) is an explicit, MDPdependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.

[143] 
Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin.
Adaptive online gradient descent.
In John Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors,
Advances in Neural Information Processing Systems 20, pages 6572,
Cambridge, MA, September 2008. MIT Press.
[ bib 
.pdf ]
We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between sqrt(T) and logT. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.

[144]  Marco Barreno, Peter L. Bartlett, F. J. Chi, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, U. Saini, and J. Doug Tygar. Open problems in the security of learning. In Proceedings of the 1st ACM Workshop on AISec (AISec2008), pages 1926, October 2008. [ bib  DOI ] 
[145]  Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Application of machine learning in fault diagnostics of mechanical systems. In Proceedings of the World Congress on Engineering and Computer Science 2008: International Conference on Modeling, Simulation and Control 2008, pages 957962, October 2008. [ bib  .pdf ] 
[146]  Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 415423, December 2008. [ bib  .pdf ] 
[147]  Peter L. Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. Highprobability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 335342, December 2008. [ bib  .pdf ] 
[148] 
Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft.
Learning in a large function space: Privacy preserving mechanisms for
SVM learning.
Technical Report 0911.5708, arxiv.org, 2009.
[ bib 
http ]
Several recent studies in privacypreserving learning have considered the tradeoff between utility or risk and the level of differential privacy guaranteed by mechanisms for statistical query processing. In this paper we study this tradeoff in private Support Vector Machine (SVM) learning. We present two efficient mechanisms, one for the case of finitedimensional feature mappings and one for potentially infinitedimensional feature mappings with translationinvariant kernels. For the case of translationinvariant kernels, the proposed mechanism minimizes regularized empirical risk in a random Reproducing Kernel Hilbert Space whose kernel uniformly approximates the desired kernel with high probability. This technique, borrowed from largescale learning, allows the mechanism to respond with a finite encoding of the classifier, even when the function class is of infinite VC dimension. Differential privacy is established using a proof technique from algorithmic stability. Utilitythe mechanism's response function is pointwise epsilonclose to nonprivate SVM with probability 1deltais proven by appealing to the smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. We conclude with a lower bound on the optimal differential privacy of the SVM. This negative result states that for any delta, no mechanism can be simultaneously (epsilon,delta)useful and betadifferentially private for small epsilon and small beta.

[149] 
A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn
Song, and Peter L. Bartlett.
A learningbased approach to reactive security.
Technical Report 0912.1155, arxiv.org, 2009.
[ bib 
http ]
Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our gametheoretic model follows common practice in the security literature by making worstcase assumptions about the attacker: we grant the attacker complete knowledge of the defender's strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker's incentives and knowledge.

[150] 
Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin.
A stochastic view of optimal regret through minimax duality.
Technical Report 0903.5328, arxiv.org, 2009.
[ bib 
http ]
We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functionalthe minimizer over the player's actions of expected lossdefined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.

[151]  Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: oneinclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):3759, January 2009. (Was University of California, Berkeley, EECS Department Technical Report EECS200786). [ bib  .pdf ] 
[152] 
Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin Wainwright.
Informationtheoretic lower bounds on the oracle complexity of convex
optimization.
In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and
A. Culotta, editors, Advances in Neural Information Processing Systems
22, pages 19, June 2009.
[ bib 
.pdf ]
Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding very critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to learning and estimation.

[153] 
Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin.
A stochastic view of optimal regret through minimax duality.
In Proceedings of the 22nd Annual Conference on Learning Theory
 COLT 2009, pages 257266, June 2009.
[ bib 
.pdf ]
We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functionalthe minimizer over the player's actions of expected lossdefined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.

[154]  Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI2009), pages 3542, June 2009. [ bib  .pdf ] 
[155]  David S. Rosenberg, Vikas Sindhwani, Peter L. Bartlett, and Partha Niyogi. Multiview point cloud kernels for semisupervised learning. IEEE Signal Processing Magazine, 26(5):145150, September 2009. [ bib  DOI ] 
[156] 
Peter L. Bartlett, Shahar Mendelson, and Petra Philips.
On the optimality of samplebased estimates of the expectation of the
empirical minimizer.
ESAIM: Probability and Statistics, 14:315337, January 2010.
[ bib 
.pdf ]
We study samplebased estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely datadependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bmem03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.

[157]  A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learningbased approach to reactive security. In Proceedings of Financial Cryptography and Data Security (FC10), pages 192206, 2010. [ bib  DOI ] 
[158] 
Jacob Abernethy, Peter L. Bartlett, and Elad Hazan.
Blackwell approachability and noregret learning are equivalent.
Technical Report 1011.1936, arxiv.org, 2010.
[ bib 
http ]
We consider the celebrated Blackwell Approachability Theorem for twoplayer games with vector payoffs. We show that Blackwell's result is equivalent, via efficient reductions, to the existence of 'noregret' algorithms for Online Linear Optimization. Indeed, we show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide a useful application of this reduction: the first efficient algorithm for calibrated forecasting.

[159] 
Marius Kloft, Ulrich Rückert, and Peter L. Bartlett.
A unifying view of multiple kernel learning.
Technical Report 1005.0437, arxiv.org, 2010.
[ bib 
http ]
Recent research on multiple kernel learning has lead to a number of approaches for combining kernels in regularized risk minimization. The proposed approaches include different formulations of objectives and varying regularization strategies. In this paper we present a unifying general optimization criterion for multiple kernel learning and show how existing formulations are subsumed as special cases. We also derive the criterion's dual representation, which is suitable for general smooth optimization algorithms. Finally, we evaluate multiple kernel learning in this framework analytically using a Rademacher complexity bound on the generalization error and empirically in a set of experiments.

[160]  Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Corrigendum to `shifting: Oneinclusion mistake bounds and sample compression' [J. Comput. System Sci 75 (1) (2009) 3759]. Journal of Computer and System Sciences, 76(34):278280, May 2010. [ bib  DOI ] 
[161]  Peter L. Bartlett. Learning to act in uncertain environments. Communications of the ACM, 53(5):98, May 2010. (Invited onepage comment). [ bib  DOI ] 
[162] 
Alekh Agarwal, Peter L. Bartlett, and Max Dama.
Optimal allocation strategies for the dark pool problem.
In Y. W. Teh and M. Titterington, editors, Proceedings of The
Thirteenth International Conference on Artificial Intelligence and Statistics
(AISTATS), volume 9, pages 916, May 2010.
[ bib 
.pdf ]
We study the problem of allocating stocks to dark pools. We propose and analyze an optimal approach for allocations, if continuousvalued allocations are allowed. We also propose a modification for the case when only integervalued allocations are possible. We extend the previous work on this problem to adversarial scenarios, while also improving on those results in the iid setup. The resulting algorithms are efficient, and perform well in simulations under stochastic and adversarial inputs.

[163]  Brian Kulis and Peter L. Bartlett. Implicit online learning. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 575582, June 2010. [ bib  .pdf ] 
[164]  Marius Kloft, Ulrich Rückert, and Peter L. Bartlett. A unifying view of multiple kernel learning. In José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, editors, Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD, pages 6681, September 2010. Part II, LNAI 6322. [ bib  DOI ] 
[165]  Peter L. Bartlett. Optimal online prediction in adversarial environments. In Marcus Hutter, Frank Stephan, Vladimir Vovk, and Thomas Zeugmann, editors, Algorithmic Learning Theory, 21st International Conference, ALT 2010, page 34, October 2010. (Plenary talk abstract). [ bib  DOI ] 
[166]  Jacob Abernethy, Peter L. Bartlett, Niv Buchbinder, and Isabelle Stanton. A regularization approach to metrical task systems. In Marcus Hutter, Frank Stephan, Vladimir Vovk, and Thomas Zeugmann, editors, Algorithmic Learning Theory, 21st International Conference, ALT 2010, pages 270284, October 2010. [ bib  DOI ] 
[167] 
Sylvain Arlot and Peter L. Bartlett.
Marginadaptive model selection in statistical learning.
Bernoulli, 17(2):687713, May 2011.
[ bib 
.pdf ]

[168] 
Alekh Agarwal, John Duchi, Peter L. Bartlett, and Clement Levrard.
Oracle inequalities for computationally budgeted model selection.
In Sham Kakade and Ulrike von Luxburg, editors, Proceedings of
the Conference on Learning Theory (COLT2011), volume 19, pages 6986, July
2011.
[ bib 
.pdf ]
We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the effects of computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget.

[169] 
Jacob Abernethy, Peter L. Bartlett, and Elad Hazan.
Blackwell approachability and noregret learning are equivalent.
In Sham Kakade and Ulrike von Luxburg, editors, Proceedings of
the Conference on Learning Theory (COLT2011), volume 19, pages 2746, July
2011.
[ bib 
.pdf ]

[170] 
Afshin Rostamizadeh, Alekh Agarwal, and Peter L. Bartlett.
Learning with missing features.
In Avi Pfeffer and Fabio G. Cozman, editors, Proceedings of the
Conference on Uncertainty in Artificial Intelligence (UAI2011), pages
635642, July 2011.
[ bib 
.pdf ]

[171]  John ShaweTaylor, Richard Zemel, Peter L. Bartlett, Fernando Pereira, and Kilian Weinberger, editors. Advances in Neural Information Processing Systems 24. Proceedings of the 2011 Conference. NIPS Foundation, December 2011. [ bib  .html ] 
[172] 
John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright.
Randomized Smoothing for (Parallel) Stochastic Optimization.
In 2012 IEEE 51ST ANNUAL CONFERENCE ON DECISION AND CONTROL
(CDC), IEEE Conference on Decision and Control, pages 54425444, 345
E 47TH ST, NEW YORK, NY 10017 USA, 2012. IEEE.
[ bib ]
By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variancebased rates for nonsmooth optimization. A combination of our techniques with recent work on decentralized optimization yields orderoptimal parallel stochastic optimization algorithms. We give applications of our results to several statistical machine learning problems, providing experimental results (in the full version of the paper) demonstrating the effectiveness of our algorithms.

[173] 
Alekh Agarwal, Peter L. Bartlett, and John Duchi.
Oracle inequalities for computationally adaptive model selection.
Technical Report 1208.0129, arxiv.org, 2012.
[ bib 
http ]
We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our computational budget to the optimal function class.

[174] 
Fares Hedayati and Peter L. Bartlett.
Exchangeability characterizes optimality of sequential normalized
maximum likelihood and Bayesian prediction with Jeffreys prior.
In M. Girolami and N. Lawrence, editors, Proceedings of the
Fifteenth International Conference on Artificial Intelligence and Statistics
(AISTATS), volume 22, pages 504510, April 2012.
[ bib 
.pdf ]
We study online prediction of individual sequences under logarithmic loss with parametric experts. The optimal strategy, normalized maximum likelihood (NML), is computationally demanding and requires the length of the game to be known. We consider two simpler strategies: sequential normalized maximum likelihood (SNML), which computes the NML forecasts at each round as if it were the last round, and Bayesian prediction. Under appropriate conditions, both are known to achieve nearoptimal regret. In this paper, we investigate when these strategies are optimal. We show that SNML is optimal iff the joint distribution on sequences defined by SNML is exchangeable. In the case of exponential families, this is equivalent to the optimality of any Bayesian prediction strategy, and the optimal prior is Jeffreys prior.

[175] 
Alekh Agarwal, Peter Bartlett, Pradeep Ravikumar, and Martin Wainwright.
Informationtheoretic lower bounds on the oracle complexity of
stochastic convex optimization.
IEEE Transactions on Information Theory, 58(5):32353249, May
2012.
[ bib 
DOI 
.pdf ]
Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexitytheoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.

[176] 
Fares Hedayati and Peter Bartlett.
The optimality of Jeffreys prior for online density estimation and
the asymptotic normality of maximum likelihood estimators.
In Proceedings of the Conference on Learning Theory (COLT2012),
volume 23, pages 7.17.13, June 2012.
[ bib 
.pdf ]
We study online learning under logarithmic loss with regular parametric models. We show that a Bayesian strategy predicts optimally only if it uses Jeffreys prior. This result was known for canonical exponential families; we extend it to parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.

[177] 
John Duchi, Peter L. Bartlett, and Martin J. Wainwright.
Randomized smoothing for stochastic optimization.
SIAM Journal on Optimization, 22(2):674701, June 2012.
[ bib 
.pdf ]
We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variancebased rates for nonsmooth optimization. We give several applications of our results to statistical estimation problems and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is orderoptimal.

[178] 
A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn
Song, and Peter L. Bartlett.
A learningbased approach to reactive security.
IEEE Transactions on Dependable and Secure Computing,
9(4):482493, July 2012.
[ bib 
http 
.pdf ]
Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our gametheoretic model follows common practice in the security literature by making worstcase assumptions about the attacker: we grant the attacker complete knowledge of the defender’s strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker’s incentives and knowledge.

[179]  Massieh Najafi, David M. Auslander, Peter L. Bartlett, Philip Haves, and Michael D. Sohn. Application of machine learning in the fault diagnostics of air handling units. Applied Energy, 96:347358, August 2012. [ bib  DOI ] 
[180] 
Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft.
Learning in a large function space: Privacy preserving mechanisms for
SVM learning.
Journal of Privacy and Confidentiality, 4(1):65100, August
2012.
[ bib 
http ]

[181] 
Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman.
l_1regularized linear regression: Persistence and oracle
inequalities.
Probability Theory and Related Fields, 154(12):193224,
October 2012.
[ bib 
DOI 
.pdf ]
We study the predictive performance of _1regularized linear regression in a modelfree setting, including the case where the number of covariates is substantially larger than the sample size. We introduce a new analysis method that avoids the boundedness problems that typically arise in modelfree empirical minimization. Our technique provides an answer to a conjecture of Greenshtein and Ritov [?] regarding the “persistence” rate for linear regression and allows us to prove an oracle inequality for the error of the regularized minimizer. It also demonstrates that empirical risk minimization gives optimal rates (up to log factors) of convex aggregation of a set of estimators of a regression function.

[182]  Peter L. Bartlett, Fernando Pereira, Chris J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors. Advances in Neural Information Processing Systems 25. Proceedings of the 2012 Conference. NIPS Foundation, December 2012. [ bib  .html ] 
[183] 
Yasin AbbasiYadkori, Peter L. Bartlett, Varun Kanade, Yevgeny Seldin, and
Csaba Szepesvari.
Online learning in Markov decision processes with adversarially
chosen transition probability distributions.
In Advances in Neural Information Processing Systems 26, pages
25082516, 2013.
[ bib 
http 
.pdf ]
We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O(sqrt(TlogΠ)+logΠ) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.

[184] 
Jacob Abernethy, Peter L. Bartlett, Rafael Frongillo, and Andre Wibisono.
How to hedge an option against an adversary: BlackScholes
pricing is minimax optimal.
In Advances in Neural Information Processing Systems 26, pages
23462354, 2013.
[ bib 
http 
.pdf ]
We consider a popular problem in finance, option pricing, through the lens of an online learning game between Nature and an Investor. In the BlackScholes option pricing model from 1973, the Investor can continuously hedge the risk of an option by trading the underlying asset, assuming that the asset's price fluctuates according to Geometric Brownian Motion (GBM). We consider a worstcase model, in which Nature chooses a sequence of price fluctuations under a cumulative quadratic volatility constraint, and the Investor can make a sequence of hedging decisions. Our main result is to show that the value of our proposed game, which is the “regret” of hedging strategy, converges to the BlackScholes option price. We use significantly weaker assumptions than previous workfor instance, we allow large jumps in the asset priceand show that the BlackScholes hedging strategy is nearoptimal for the Investor even in this nonstochastic framework.

[185]  Yevgeny Seldin, Koby Crammer, and Peter L Bartlett. Open problem: Adversarial multiarmed bandits with limited advice. In Proceedings of the Conference on Learning Theory (COLT2013), volume 30, pages 10671072, 2013. [ bib  .pdf ] 
[186] 
Peter L. Bartlett, Peter Grunwald, Peter Harremoes, Fares Hedayati, and
Wojciech Kotlowski.
Horizonindependent optimal prediction with logloss in exponential
families.
In Proceedings of the Conference on Learning Theory (COLT2013),
volume 30, pages 639661, 2013.
[ bib 
.pdf ]
We study online learning under logarithmic loss with regular parametric models. Hedayati and Bartlett (2012) showed that a Bayesian prediction strategy with Jeffreys prior and sequential normalized maximum likelihood (SNML) coincide and are optimal if and only if the latter is exchangeable, which occurs if and only if the optimal strategy can be calculated without knowing the time horizon in advance. They put forward the question what families have exchangeable SNML strategies. We answer this question for onedimensional exponential families: SNML is exchangeable only for three classes of natural exponential family distributions,namely the Gaussian, the gamma, and the Tweedie exponential family of order 3/2.

[187] 
Alex Kantchelian, Michael C Tschantz, Ling Huang, Peter L Bartlett, Anthony D
Joseph, and J. Doug Tygar.
Largemargin convex polytope machine.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q.
Weinberger, editors, Advances in Neural Information Processing Systems
27, pages 32483256. Curran Associates, Inc., 2014.
[ bib 
.pdf ]
We present the Convex Polytope Machine (CPM), a novel nonlinear learning algorithm for largescale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid suboptimal local minima. Our experimental evaluations of the CPM on largescale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than stateoftheart similar approaches and kernelSVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of subclassifiers (sides of the polytope) to avoid overfitting.

[188] 
Wouter M Koolen, Alan Malek, and Peter L Bartlett.
Efficient minimax strategies for square loss games.
In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q.
Weinberger, editors, Advances in Neural Information Processing Systems
27, pages 32303238. Curran Associates, Inc., 2014.
[ bib 
.pdf ]
We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance. We derive the minimax solutions for the case where the prediction and action spaces are the simplex (this setup is sometimes called the Brier game) and the _2 ball (this setup is related to Gaussian density estimation). We show that in both cases the value of each subgame is a quadratic function of a simple statistic of the state, with coefficients that can be efficiently computed using an explicit recurrence relation. The resulting deterministic minimax strategy and randomized maximin strategy are linear functions of the statistic.

[189] 
Yasin AbbasiYadkori, Peter L. Bartlett, and Alan Malek.
Linear programming for largescale Markov decision problems.
In Proceedings of the 31st International Conference on Machine
Learning (ICML14), pages 496504, 2014.
[ bib 
.html 
.pdf ]
We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a lowdimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over stateaction pairs, and we consider a neighborhood of a lowdimensional subset of the set of stationary distributions (defined in terms of stateaction features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.

[190] 
Yasin AbbasiYadkori, Peter L. Bartlett, and Alan Malek.
Linear programming for largescale Markov decision problems.
Technical Report 1402.6763, arXiv.org, 2014.
[ bib 
http ]
We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a lowdimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over stateaction pairs, and we consider a neighborhood of a lowdimensional subset of the set of stationary distributions (defined in terms of stateaction features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.

[191] 
J. Hyam Rubinstein, Benjamin Rubinstein, and Peter Bartlett.
Bounding embeddings of VC classes into maximum classes.
In A. Gammerman and V. Vovk, editors, Festschrift of Alexey
Chervonenkis. Springer, 2014.
[ bib 
http ]
One of the earliest conjectures in computational learning theorythe Sample Compression Conjectureasserts that concept classes (or set systems) admit compression schemes of size polynomial in their VC dimension. Todate this statement is known to be true for maximum classesthose that meet Sauer's Lemma, which bounds class cardinality in terms of VC dimension, with equality. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without superlinear increase to their VC dimensions, as such embeddings extend the known compression schemes to all VC classes. We show that maximum classes can be characterized by a localconnectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterization of maximum VC classes is applied to prove a negative embedding result which demonstrates VCd classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we give a general recursive procedure for embedding VCd classes into VC(d+k) maximum classes for smallest k.

[192] 
J. Hyam Rubinstein, Benjamin Rubinstein, and Peter Bartlett.
Bounding embeddings of VC classes into maximum classes.
Technical Report 1401.7388, arXiv.org, 2014.
[ bib 
http ]
One of the earliest conjectures in computational learning theorythe Sample Compression Conjectureasserts that concept classes (or set systems) admit compression schemes of size polynomial in their VC dimension. Todate this statement is known to be true for maximum classesthose that meet Sauer's Lemma, which bounds class cardinality in terms of VC dimension, with equality. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without superlinear increase to their VC dimensions, as such embeddings extend the known compression schemes to all VC classes. We show that maximum classes can be characterized by a localconnectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterization of maximum VC classes is applied to prove a negative embedding result which demonstrates VCd classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we give a general recursive procedure for embedding VCd classes into VC(d+k) maximum classes for smallest k.

[193] 
Yasin AbbasiYadkori, Peter L. Bartlett, and Varun Kanade.
Tracking adversarial targets.
In Proceedings of the 31st International Conference on Machine
Learning (ICML14), pages 369377, 2014.
[ bib 
.html 
.pdf ]
We study linear quadratic problems with adversarial tracking targets. We propose a Follow The Leader algorithm and show that, under a stability condition, its regret grows as the logarithm of the number of rounds of the game. We also study a problem with adversarially chosen transition dynamics, for which an exponentiallyweighted average algorithm is proposed and analyzed.

[194] 
Yevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin AbbasiYadkori.
Prediction with limited advice and multiarmed bandits with paid
observations.
In Proceedings of the 31st International Conference on Machine
Learning (ICML14), pages 280287, 2014.
[ bib 
.html 
.pdf ]
We study two basic questions in online learning. The first question is what happens between fullinformation and limitedfeedback games and the second question is the cost of information acquisition in online learning. The questions are addressed by defining two variations of standard online learning games. In the first variation, prediction with limited advice, we consider a game of prediction with expert advice, where on each round of the game we query the advice of a subset of M out of N experts. We present an algorithm that achieves O(sqrt((N/M)TlnN)) regret on T rounds of this game. The second variation, the multiarmed bandit with paid observations, is a variant of the adversarial Narmed bandit game, where on round t of the game, we can observe the reward of any number of arms, but each observation has a cost c. We present an algorithm that achieves O((c N lnN)^1/3 T^2/3) regret on T rounds of this game. We present lower bounds that show that, apart from the logarithmic factors, these regret bounds cannot be improved.

[195]  Ambuj Tewari and Peter L. Bartlett. Learning theory. In Paulo S.R. Diniz, Johan A.K. Suykens, Rama Chellappa, and Sergios Theodoridis, editors, Signal Processing Theory and Machine Learning, volume 1 of Academic Press Library in Signal Processing, pages 775816. Elsevier, 2014. [ bib ] 
[196] 
Peter L. Bartlett.
Online prediction.
Technical report, UC Berkeley EECS, 2015.
[ bib 
.pdf ]
We review gametheoretic models of prediction, in which the process generating the data is modelled as an adversary with whom the prediction method competes. We present a formulation that encompasses a wide variety of decision problems, and focus on the relationship between prediction in this gametheoretic setting and prediction in the more standard probabilistic setting. In particular, we present a view of standard prediction strategies as Bayesian decision methods, and we show how the regret of optimal strategies depends on complexity measures that are closely related to those that appear in probabilistic settings.

[197] 
Fares Hedayati and Peter L. Bartlett.
Exchangeability characterizes optimality of sequential normalized
maximum likelihood and Bayesian prediction.
Technical report, UC Berkeley EECS, 2015.
[ bib 
.pdf 
.pdf ]
We study online learning under logarithmic loss with regular parametric models. In this setting, each strategy corresponds to a joint distribution on sequences. The minimax optimal strategy is the normalized maximum likelihood (NML) strategy. We show that the sequential normalized maximum likelihood (SNML) strategy predicts minimax optimally (i.e. as NML) if and only if the joint distribution on sequences defined by SNML is exchangeable. This property also characterizes the optimality of a Bayesian prediction strategy. In that case, the optimal prior distribution is Jeffreys prior for a broad class of parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.

[198] 
Yasin AbbasiYadkori, Peter L. Bartlett, and Stephen Wright.
A Lagrangian relaxation approach to Markov decision problems.
Technical report, UC Berkeley EECS, 2015.
[ bib ]
We study Markov decision problems (MDPs) over a restricted policy class, and show that a Lagrangian relaxation approach finds nearoptimal policies in this class efficiently. In particular, the computational complexity depends on the number of features used to define policies, and not on the size of the state space. The statistical complexity also scales well: our method requires only lowdimensional second order statistics. Most valuefunctionbased methods for MDPs return a policy that is greedy with respect to the value function estimate. We discuss drawbacks of this approach, and propose a new policy class defined for some parameter vector w by π_w(ax) = ( 1Q_w(x,a) + _ν(.x) Q_w ) ν(ax), where Q_w is the stateaction value function, ν is a baseline policy, and the mean of Q_w under ν(.x) acts as a normalizer. Similar to the greedy and Gibbs policies, the proposed policy assigns larger probabilities to actions with smaller valuefunction estimates. We demonstrate the effectiveness of our Lagrangian relaxation approach, applied to this policy class, on a queueing problem and an energy storage application.

[199] 
Walid Krichene, Alexandre Bayen, and Peter L. Bartlett.
Accelerating mirror descent in continuous and discrete time.
In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett, and
R. Garnett, editors, Advances in Neural Information Processing Systems
28, pages 28272835. Curran Associates, Inc., 2015.
[ bib 
.pdf ]
We study accelerated mirror descent dynamics in continuous and discrete time. Combining the original continuoustime motivation of mirror descent with a recent ODE interpretation of Nesterov's accelerated method, we propose a family of continuoustime descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories are guaranteed to converge to the optimum at a (1/t^2) rate. We then show that a large family of firstorder accelerated methods can be obtained as a discretization of the ODE, and these methods converge at a (1/k^2) rate. This connection between accelerated mirror descent and the ODE provides an intuitive approach to the design and analysis of accelerated firstorder algorithms.

[200]  Walid Krichene, Alexandre Bayen, and Peter L. Bartlett. Accelerating mirror descent in continuous and discrete time. Technical report, EECS Department, University of California, Berkeley, 2015. [ bib ] 
[201]  Yasin AbbasiYadkori, Wouter Koolen, Alan Malek, and Peter L. Bartlett. Minimax time series prediction. Technical report, EECS Department, University of California, Berkeley, 2015. [ bib ] 
[202] 
Wouter Koolen, Alan Malek, Peter L. Bartlett, and Yasin AbbasiYadkori.
Minimax time series prediction.
In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett, and
R. Garnett, editors, Advances in Neural Information Processing Systems
28, pages 25482556. Curran Associates, Inc., 2015.
[ bib 
.pdf ]
We consider an adversarial formulation of the problem of predicting a time series with square loss. The aim is to predict an arbitrary sequence of vectors almost as well as the best smooth comparator sequence in retrospect. Our approach allows natural measures of smoothness, such as the squared norm of increments. More generally, we can consider a linear time series model and penalize the comparator sequence through the energy of the implied driving noise terms. We derive the minimax strategy for all problems of this type, and we show that it can be implemented efficiently. The optimal predictions are linear in the previous observations. We obtain an explicit expression for the regret in terms of the parameters defining the problem. For typical, simple definitions of smoothness, the computation of the optimal predictions involves only sparse matrices. In the case of normconstrained data, where the smoothness is defined in terms of the squared norm of the comparator's increments, we show that the regret grows as T/sqrt(\lambda, where T is the length of the game and λ specifies the smoothness of the comparator.)

[203] 
Peter L. Bartlett, Wouter Koolen, Alan Malek, Eiji Takimoto, and Manfred
Warmuth.
Minimax fixeddesign linear regression.
In Proceedings of the Conference on Learning Theory (COLT2015),
volume 40, pages 226239, June 2015.
[ bib 
.pdf 
.pdf ]
We consider a linear regression game in which the covariates are known in advance: at each round, the learner predicts a realvalue, the adversary reveals a label, and the learner incurs a squared error loss. The aim is to minimize the regret with respect to linear predictions. For a variety of constraints on the adversary's labels, we show that the minimax optimal strategy is linear, with a parameter choice that is reminiscent of ordinary least squares (and as easy to compute). The predictions depend on all covariates, past and future, with a particular weighting assigned to future covariates corresponding to the role that they play in the minimax regret. We study two families of label sequences: box constraints (under a covariate compatibility condition), and a weighted 2norm constraint that emerges naturally from the analysis. The strategy is adaptive in the sense that it requires no knowledge of the constraint set. We obtain an explicit expression for the minimax regret for these games. For the case of uniform box constraints, we show that, with worst case covariate sequences, the regret is O(dlogT), with no dependence on the scaling of the covariates.

[204] 
Yasin AbbasiYadkori, Peter L Bartlett, Xi Chen, and Alan Malek.
Largescale Markov decision problems with KL control cost.
In Proceedings of the 32nd International Conference on Machine
Learning (ICML15), volume 37, pages 10531062, June 2015.
[ bib 
.html 
.pdf ]

[205] 
Walid Krichene, Alexandre Bayen, and Peter L. Bartlett.
Adaptive averaging in accelerated descent dynamics.
In Advances in Neural Information Processing Systems 29, pages
29912999, 2016.
[ bib 
http 
.pdf ]
We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate η(t), and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument, we give sufficient conditions on η and w to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuoustime, prove that it preserves the quadratic convergence rate of accelerated firstorder methods in discretetime, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases.

[206] 
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and
Peter L. Bartlett.
Improved learning complexity in combinatorial pure exploration
bandits.
In Proceedings of AISTATS 2016, pages 10041012, 2016.
[ bib 
.html 
.pdf ]
We study the problem of combinatorial pure exploration in the stochastic multiarmed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant.

[207] 
Yasin AbbasiYadkori, Peter L. Bartlett, and Stephen Wright.
A fast and reliable policy improvement algorithm.
In Proceedings of AISTATS 2016, pages 13381346, 2016.
[ bib 
.html 
.pdf ]
We introduce a simple, efficient method that improves stochastic policies for Markov decision processes. The computational complexity is the same as that of the value estimation problem. We prove that when the value estimation error is small, this method gives an improvement in performance that increases with certain variance properties of the initial policy and transition dynamics. Performance in numerical experiments compares favorably with previous policy improvement algorithms.

[208]  Niladri Chatterji and Peter Bartlett. Alternating minimization for dictionary learning with random initialization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 19972006. Curran Associates, Inc., 2017. [ bib  .pdf  .pdf ] 
[209] 
Walid Krichene and Peter Bartlett.
Acceleration and averaging in stochastic descent dynamics.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 67966806. Curran Associates, Inc., 2017.
[ bib 
.pdf 
.pdf ]
We formulate and study a general family of (continuoustime) stochastic dynamics for accelerated firstorder minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.

[210] 
Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian.
Nearlytight VCdimension and pseudodimension bounds for piecewise
linear neural networks.
Technical Report 1703.02930, arXiv.org, 2017.
[ bib 
http 
.pdf ]
We prove new upper and lower bounds on the VCdimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting W be the number of weights and L be the number of layers, we prove that the VCdimension is O(W Llog(W)), and provide examples with VCdimension Ω(W Llog(W/L)). This improves both the previously known upper bounds and lower bounds. In terms of the number U of nonlinear units, we prove a tight bound Θ(W U) on the VCdimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VCdimension on depth for networks with different nonlinearities: there is no dependence for piecewiseconstant, linear dependence for piecewiselinear, and no more than quadratic dependence for general piecewisepolynomial.

[211] 
Peter Bartlett, Dylan Foster, and Matus Telgarsky.
Spectrallynormalized margin bounds for neural networks.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 62406249. Curran Associates, Inc., 2017.
[ bib 
.pdf 
.pdf ]
This paper presents a marginbased multiclass generalization bound for neural networks that scales with their marginnormalized spectral complexity: their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.

[212] 
Yasin AbbasiYadkori, , Peter L. Bartlett, and Victor Gabillon.
Near minimax optimal players for the finitetime 3expert prediction
problem.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 30333042. Curran Associates, Inc., 2017.
[ bib 
.pdf 
.pdf ]
We study minimax strategies for the online prediction problem with expert advice. It is conjectured that a simple adversary strategy, called Comb, is optimal in this game for any number of experts including the non asymptotic case where the number of experts is small. We make progress in this direction by showing that Comb is minimax optimal within an additive logarithmic error in the finite time three expert problems.

[213] 
Martin Péron, Kai Helge Becker, Peter L. Bartlett, and Iadine Chadès.
Fasttracking stationary MOMDPs for adaptive management problems.
In Proceedings of the ThirtyFirst AAAI Conference on Artificial
Intelligence (AAAI17), pages 45314537, 2017.
[ bib 
http 
http ]
Adaptive management is applied in conservation and natural resource management, and consists of making sequential decisions when the transition matrix is uncertain. Informally described as ’learning by doing’, this approach aims to trade off between decisions that help achieve the objective and decisions that will yield a better knowledge of the true transition matrix. When the true transition matrix is assumed to be an element of a finite set of possible matrices, solving a mixed observability Markov decision process (MOMDP) leads to an optimal tradeoff but is very computationally demanding. Under the assumption (common in adaptive management) that the true transition matrix is stationary, we propose a polynomialtime algorithm to find a lower bound of the value function. In the corners of the domain of the value function (belief space), this lower bound is provably equal to the optimal value function. We also show that under further assumptions, it is a linear approximation of the optimal value function in a neighborhood around the corners. We evaluate the benefits of our approach by using it to initialize the solvers MOSARSOP and Perseus on a novel computational sustainability problem and a recent adaptive management data challenge. Our approach leads to an improved initial value function and translates into significant computational gains for both solvers.

[214] 
Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon.
Recovery guarantees for onehiddenlayer neural networks.
In Doina Precup and Yee Whye Teh, editors, Proceedings of the
34th International Conference on Machine Learning (ICML17), volume 70 of
Proceedings of Machine Learning Research, pages 41404149. PMLR, 2017.
[ bib 
.html 
.pdf ]
In this paper, we consider regression problems with onehiddenlayer neural networks (1NNs). We distill some properties of activation functions that lead to local strong convexity in the neighborhood of the groundtruth parameters for the 1NN squaredloss objective and most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs, squared ReLUs and sigmoids. For activation functions that are also smooth, we show local linear convergence guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity d ·log(1/ε) ·poly(k,λ) and computational complexity n·d ·poly(k,λ) for smooth homogeneous activations with high probability, where d is the dimension of the input, k (k<=d) is the number of hidden nodes, λ is a conditioning property of the groundtruth parameter matrix between the input layer and the hidden layer, ε is the targeted precision and n is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision.

[215] 
Yasin AbbasiYadkori, Alan Malek, Peter L. Bartlett, and Victor Gabillon.
Hitandrun for sampling and planning in nonconvex spaces.
In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th
International Conference on Artificial Intelligence and Statistics,
volume 54 of Proceedings of Machine Learning Research, pages 888895,
Fort Lauderdale, FL, USA, 2017.
[ bib 
.pdf ]
We propose the HitandRun algorithm for planning and sampling problems in nonconvex spaces. For sampling, we show the first analysis of the HitandRun algorithm in nonconvex spaces and show that it mixes fast as long as certain smoothness conditions are satisfied. In particular, our analysis reveals an intriguing connection between fast mixing and the existence of smooth measurepreserving mappings from a convex space to the nonconvex space. For planning, we show advantages of HitandRun compared to stateoftheart planning methods such as RapidlyExploring Random Trees.

[216] 
Fares Hedayati and Peter L. Bartlett.
Exchangeability characterizes optimality of sequential normalized
maximum likelihood and Bayesian prediction.
IEEE Transactions on Information Theory, 63(10):67676773,
October 2017.
[ bib 
DOI 
.pdf 
.pdf ]
We study online learning under logarithmic loss with regular parametric models. In this setting, each strategy corresponds to a joint distribution on sequences. The minimax optimal strategy is the normalized maximum likelihood (NML) strategy. We show that the sequential normalized maximum likelihood (SNML) strategy predicts minimax optimally (i.e. as NML) if and only if the joint distribution on sequences defined by SNML is exchangeable. This property also characterizes the optimality of a Bayesian prediction strategy. In that case, the optimal prior distribution is Jeffreys prior for a broad class of parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.

[217] 
Yasin AbbasiYadkori, Peter L. Bartlett, Victor Gabillon, Alan Malek, and
Michal Valko.
Best of both worlds: Stochastic and adversarial bestarm
identification.
In Proceedings of the Conference on Learning Theory (COLT2018),
2018.
[ bib 
http 
.pdf ]
We study bandit bestarm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameterfree algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.

[218] 
Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan.
Underdamped Langevin MCMC: A nonasymptotic analysis.
In Proceedings of the Conference on Learning Theory (COLT2018),
2018.
[ bib 
.html 
.pdf ]
We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves ε error (in 2Wasserstein distance) in O(sqrt(d/ε) steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is O(d/ε^2) steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.)

[219] 
Peter L. Bartlett, David P. Helmbold, and Philip M. Long.
Gradient descent with identity initialization efficiently learns
positive definite linear transformations by deep residual networks.
In Proceedings of the 35th International Conference on Machine
Learning (ICML18), 2018.
[ bib 
http 
.pdf ]
We analyze algorithms for approximating a function f(x) = Φx mapping ^d to ^d using deep linear neural networks, i.e. that learn a function h parameterized by matrices Θ_1, ..., Θ_L and defined by h(x) = Θ_LΘ_L−1...Θ_1x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix Φ, in the case where the initial hypothesis Θ_1 = ...= Θ_L = I has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ is symmetric positive definite, we show that an algorithm that initializes Θ_i = I learns an εapproximation of f using a number of updates polynomial in L, the condition number of Φ, and log(d/ε). In contrast, we show that if the least squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u^Φu>0 for all u, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u^Θ_LΘ_L1...Θ_1 u>0 for all u, and another that `balances' Θ_1, ..., Θ_L so that they have the same singular values.

[220]  Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In Proceedings of the 35th International Conference on Machine Learning (ICML18), 2018. [ bib  .html  .pdf ] 
[221] 
Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett.
Byzantinerobust distributed learning: Towards optimal statistical
rates.
In Proceedings of the 35th International Conference on Machine
Learning (ICML18), 2018.
[ bib 
.html 
.pdf ]
we develop distributed optimization algorithms that are provably robust against Byzantine failuresarbitrary and potentially adversarial behavior, in distributed computing systems, with a focus on achieving optimal statistical performance. A main result of this work is a sharp analysis of two robust distributed gradient descent algorithms based on median and trimmed mean operations, respectively. We prove statistical error rates for all of strongly convex, nonstrongly convex, and smooth nonconvex population loss functions. In particular, these algorithms are shown to achieve orderoptimal statistical error rates for strongly convex losses. To achieve better communication efficiency, we further propose a medianbased distributed algorithm that is provably robust, and uses only one communication round. For strongly convex quadratic loss, we show that this algorithm achieves the same optimal error rate as the robust distributed gradient descent algorithms.

[222]  Martin Péron, Peter Bartlett, Kai Helge Becker, Kate Helmstedt, and Iadine Chadès. Two approximate dynamic programming algorithms for managing complete SIS networks. In ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS 2018), 2018. [ bib  http  .pdf ] 
[223]  Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KLdivergence. In Proceedings of ALT2018, 2018. [ bib  .html  .pdf ] 
[224]  Xiang Cheng, Fred Roosta, Stefan Palombo, Peter Bartlett, and Michael Mahoney. Flag n’ flare: Fast linearlycoupled adaptive gradient methods. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, 2018. [ bib  .html  .pdf ] 
[225]  Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, 2018. [ bib  .html  .pdf ] 
[226] 
Peter L. Bartlett, Steven Evans, and Philip M. Long.
Representing smooth functions as compositions of nearidentity
functions with implications for deep network optimization.
Technical Report 1804.05012, arXiv.org, 2018.
[ bib 
http 
.pdf ]
We show that any smooth biLipschitz h can be represented exactly as a composition h_mo...h_1 of functions h_1,...,h_m that are close to the identity in the sense that each (h_i − Id) is Lipschitz, and the Lipschitz constant decreases inversely with the number m of functions composed. This implies that h can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of nearidentity nonlinear maps. We show that, regarding Fréchet derivatives with respect to the h_1,...,h_m, any critical point of a quadratic criterion in this nearidentity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixedsize residual network with sigmoid activation functions, we show that there are nearidentity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to nearidentity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the nearidentity region.

This file was generated by bibtex2html 1.98.