Peter Bartlett's Publications (1993-2024)

[1]	P. L. Bartlett. Vapnik-Chervonenkis dimension bounds for two- and three-layer networks. Neural Computation, 5(3):371--373, 1993. [ bib ]
[2]	D. R. Lovell and P. L. Bartlett. Error and variance bounds in multi-layer neural networks. In Proceedings of the Fourth Australian Conference on Neural Networks, pages 161--164, 1993. [ bib ]
[3]	P. L. Bartlett. The sample size necessary for learning in multi-layer networks. In Proceedings of the Fourth Australian Conference on Neural Networks, pages 14--17, 1993. [ bib ]
[4]	P. L. Bartlett. Lower bounds on the Vapnik-Chervonenkis dimension of multi-layer threshold networks. In Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, pages 144--150. ACM Press, 1993. [ bib ]
[5]	P. L. Bartlett. Computational learning theory. In A. Kent and J. G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 31, pages 83--99. Marcel Dekker, 1994. [ bib ]
[6]	P. L. Bartlett, P. Fischer, and K.-U. Höffgen. Exploiting random walks for learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 318--327. ACM Press, 1994. [ bib ]
[7]	P. L. Bartlett, P. M. Long, and R. C. Williamson. Fat-shattering and the learnability of real-valued functions. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 299--310. ACM Press, 1994. [ bib ]
[8]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VC-dimension of smoothly parametrized function classes. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pages 362--367. ACM Press, 1994. [ bib ]
[9]	P. L. Bartlett. Learning quantized real-valued functions. In Proceedings of Computing: the Australian Theory Seminar, pages 24--35. University of Technology Sydney, 1994. [ bib ]
[10]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. The Vapnik-Chervonenkis dimension of neural networks with restricted parameter ranges. In Proceedings of the Fifth Australian Conference on Neural Networks, pages 198--201, 1994. [ bib ]
[11]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VC-dimension of smoothly parametrized function classes. Neural Computation, 7:990--1002, 1995. (See also correction, Neural Computation, 9: 765--769, 1997). [ bib ]
[12]	P. L. Bartlett and S. Dasgupta. Exponential convergence of a gradient descent algorithm for a class of recurrent neural networks. In Proceedings of the 38th Midwest Symposium on Circuits and Systems, 1995. [ bib ]
[13]	P. L. Bartlett and P. M. Long. More theorems about scale sensitive dimensions and learning. In Proceedings of the Eighth Annual ACM Conference on Computational Learning Theory, pages 392--401. ACM Press, 1995. [ bib ]
[14]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. On efficient agnostic learning of linear combinations of basis functions. In Proceedings of the Eighth Annual ACM Conference on Computational Learning Theory, pages 369--376. ACM Press, 1995. [ bib ]
[15]	M. Anthony and P. L. Bartlett. Function learning from interpolation. In Computational Learning Theory: Second European Conference, EUROCOLT 95, Barcelona Spain, March 1995, Proceedings, pages 211--221, 1995. [ bib ]
[16]	P. L. Bartlett and R. C. Williamson. The sample complexity of neural network learning with discrete inputs. In Proceedings of the Sixth Australian Conference on Neural Networks, pages 189--192, 1995. [ bib ]
[17]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fan-in. In Proceedings of the Sixth Australian Conference on Neural Networks, pages 201--204, 1995. [ bib ]
[18]	Peter L. Bartlett, Anthony Burkitt, and Robert C. Williamson, editors. Proceedings of the Seventh Australian Conference on Neural Networks. Australian National University, 1996. [ bib ]
[19]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118--2132, 1996. [ bib ]
[20]	M. Anthony, P. L. Bartlett, Y. Ishai, and J. Shawe-Taylor. Valid generalisation from approximate interpolation. Combinatorics, Probability, and Computing, 5:191--214, 1996. [ bib ]
[21]	P. L. Bartlett, P. M. Long, and R. C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434--452, 1996. (special issue on COLT`94). [ bib ]
[22]	P. L. Bartlett and R. C. Williamson. The Vapnik-Chervonenkis dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8:653--656, 1996. [ bib ]
[23]	A. Kowalczyk, J. Szymanski, P. L. Bartlett, and R. C. Williamson. Examples of learning curves from a modified VC-formalism. In Advances in Neural Information Processing Systems 8, pages 344--350, 1996. [ bib ]
[24]	P. L. Bartlett and S. R. Kulkarni. The complexity of model classes, and smoothing noisy data (invited). In Proceedings of the 35th IEEE Conference on Decision and Control, pages TM09--4, 2312--2317. IEEE, 1996. [ bib ]
[25]	L. Kammer, R. R. Bitmead, and P. L. Bartlett. Signal-based testing of LQ-optimality of controllers. In Proceedings of the 35th IEEE Conference on Decision and Control, pages FA17--2, 3620--3623. IEEE, 1996. [ bib ]
[26]	P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 131--139. ACM Press, 1996. [ bib ]
[27]	J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. A framework for structural risk minimization. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 68--76. ACM Press, 1996. [ bib ]
[28]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 140--146. ACM Press, 1996. [ bib ]
[29]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Adaptive tracking identification: the art of defalsification. In Proceedings of the 1996 IFAC World Congress, 1996. [ bib ]
[30]	P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes. IEEE Transactions on Information Theory, 43(5):1721--1724, 1997. [ bib ]
[31]	P. L. Bartlett. Book review: `Neural networks for pattern recognition,' Christopher M. Bishop. Statistics in Medicine, 16(20):2385--2386, 1997. [ bib ]
[32]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Correction to `lower bounds on the VC-dimension of smoothly parametrized function classes'. Neural Computation, 9:765--769, 1997. [ bib ]
[33]	P. L. Bartlett. For valid generalization, the size of the weights is more important than the size of the network. In Advances in Neural Information Processing Systems 9, pages 134--140, 1997. [ bib ]
[34]	R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322--330, 1997. [ bib ]
[35]	P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design (abstract). In Proceedings of the 1997 IEEE International Symposium on Information Theory, page 511, 1997. [ bib ]
[36]	P. L. Bartlett, T. Linder, and G. Lugosi. A minimax lower bound for empirical quantizer design. In S. Ben-David, editor, Proceedings of the Third European Conference on Computational Learning Theory (EuroCOLT'97), pages 220--222. Springer, 1997. [ bib ]
[37]	J. Baxter and P. L. Bartlett. A result relating convex n-widths to covering numbers with some applications to neural networks. In S. Ben-David, editor, Proceedings of the Third European Conference on Computational Learning Theory (EuroCOLT'97), pages 251--259. Springer, 1997. [ bib ]
[38]	P. L. Bartlett. Neural network learning. (abstract of invited talk.). In CONTROL 97 Conference Proceedings, Institution of Engineers Australia, page 543, 1997. [ bib ]
[39]	G. Loy and P. L. Bartlett. Generalization and the size of the weights: an experimental study. In Proceedings of the Eighth Australian Conference on Neural Networks, pages 60--64, 1997. [ bib ]
[40]	Peter L. Bartlett and Yishay Mansour, editors. Proceedings of the Eleventh Annual Conference on Computational Learning Theory. ACM Press, 1998. [ bib ]
[41]	R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651--1686, 1998. [ bib ]
[42]	P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10(8):2159--2173, 1998. [ bib ]
[43]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974--1980, 1998. [ bib ]
[44]	J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998. [ bib ]
[45]	P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802--1813, 1998. [ bib ]
[46]	P. L. Bartlett and S. Kulkarni. The complexity of model classes, and smoothing noisy data. Systems and Control Letters, 34(3):133--140, 1998. [ bib ]
[47]	P. L. Bartlett and M. Vidyasagar. Introduction to the special issue on learning theory. Systems and Control Letters, 34:113--114, 1998. [ bib ]
[48]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Optimal controller properties from closed-loop experiments. Automatica, 34(1):83--91, 1998. [ bib ]
[49]	P. L. Bartlett and P. M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2):174--190, 1998. (special issue on COLT`95). [ bib ]
[50]	P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525--536, 1998. [ bib ]
[51]	J. Baxter and P. L. Bartlett. The canonical distortion measure in feature space and 1-NN classification. In Advances in Neural Information Processing Systems 10, pages 245--251, 1998. [ bib ]
[52]	M. Golea, P. L. Bartlett, and W. S. Lee. Generalization in decision trees and DNF: Does size matter? In Advances in Neural Information Processing Systems 10, pages 259--265, 1998. [ bib ]
[53]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. In Proceedings of the IEEE Conference on Decision and Control, volume 3, pages 2874--2879, 1998. [ bib ]
[54]	B. Schölkopf, P. L. Bartlett, A. Smola, and R. Williamson. Support vector regression with automatic accuracy control. In L. Niklasson, M. Boden, and T. Ziemke, editors, Perspectives in Neural Computing: Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN'98), pages 111--116. Springer-Verlag, 1998. [ bib ]
[55]	L. Mason, P. L. Bartlett, and M. Golea. Generalization in threshold networks, combined decision trees and combined mask perceptrons. In T. Downs, M. Frean, and M. Gallagher, editors, Proceedings of the Ninth Australian Conference on Neural Networks (ACNN'98), pages 84--88. University of Queensland, 1998. [ bib ]
[56]	Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. 404pp. ISBN 978-0-521-57353-X. Reprinted 2001, 2002. Paperback edition 2009; ISBN 978-0-521-11862-0. [ bib \| .html ]
[57]	P. L. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44(1):55--62, 1999. [ bib ]
[58]	P. L. Bartlett. Efficient neural network learning. In V. D. Blondel, E. D. Sontag, M. Vidyasagar, and J. C. Willems, editors, Open Problems in Mathematical Systems Theory and Control, pages 35--38. Springer Verlag, 1999. [ bib ]
[59]	P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 43--54. MIT Press, 1999. [ bib ]
[60]	P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. In Advances in Neural Information Processing Systems 11, pages 190--196, 1999. [ bib ]
[61]	L. Mason, P. L. Bartlett, and J. Baxter. Direct optimization of margins improves generalization in combined classifiers. In Advances in Neural Information Processing Systems 11, pages 288--294, 1999. [ bib ]
[62]	B. Schölkopf, P. L. Bartlett, A. Smola, and R. Williamson. Shrinking the tube: a new support vector regression algorithm. In Advances in Neural Information Processing Systems 11, pages 330--336, 1999. [ bib ]
[63]	T. Koshizen, P. L. Bartlett, and A. Zelinsky. Sensor fusion of odometry and sonar sensors by the Gaussian mixture Bayes' technique in mobile robot position estimation. In Proceedings of the 1999 IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages 742--747, 1999. [ bib ]
[64]	Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pages 267--277, 1999. [ bib ]
[65]	P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. In Proceedings of the Fourth European Conference on Computational Learning Theory, pages 50--62, 1999. [ bib ]
[66]	L. Mason, P. L. Bartlett, and J. Baxter. Error bounds for voting classifiers using margin cost functions (invited abstract). In Proceedings of the IEEE Information Theory Workshop on Detection, Estimation, Classification and Imaging, page 36, 1999. [ bib ]
[67]	P. L. Bartlett and J. Baxter. Voting methods for data segmentation. In Proceedings of the Advanced Investment Technology Conference, pages 35--40. Bond University, 1999. [ bib ]
[68]	Alexander J. Smola, Peter L. Bartlett, Bernard Schölkopf, and Dale Schuurmans, editors. Advances in Large Margin Classifiers. MIT Press, 2000. [ bib ]
[69]	M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9:213--225, 2000. [ bib ]
[70]	L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221--246. MIT Press, 2000. [ bib ]
[71]	A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Introduction to large margin classifiers. In Advances in Large Margin Classifiers, pages 1--29. MIT Press, 2000. [ bib ]
[72]	P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41(2):153--174, 2000. [ bib ]
[73]	S. Parameswaran, M. F. Parkinson, and P. L. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):1263--1274, 2000. [ bib ]
[74]	B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207--1245, 2000. [ bib ]
[75]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. Automatica, 36(9):1301--1307, 2000. [ bib ]
[76]	L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 38(3):243--255, 2000. [ bib ]
[77]	P. L. Bartlett and J. Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the IEEE Conference on Decision and Control, volume 1, pages 124--129, 2000. [ bib ]
[78]	J. Baxter and P. L. Bartlett. Direct gradient-based reinforcement learning (invited). In Proceedings of the International Symposium on Circuits and Systems, pages III--271--274, 2000. [ bib ]
[79]	L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12, pages 512--518, 2000. [ bib ]
[80]	J. Baxter and P. L. Bartlett. GPOMDP: An on-line algorithm for estimating performance gradients in POMDP's, with applications. In Proceedings of the 2000 International Conference on Machine Learning, pages 41--48, 2000. [ bib ]
[81]	P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 286--297, 2000. [ bib ]
[82]	P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 133--141, 2000. [ bib ]
[83]	J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319--350, 2001. [ bib \| .html ]
[84]	J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351--381, 2001. [ bib \| .html ]
[85]	A. Ben-Hur, T. Barnes, P. L. Bartlett, O. Chapelle, A. Elisseeff, H. Fristche, I. Guyon, B. Schölkopf, J. Weston, E. Fung, C. Enderwick, E. A. Dalmasso, B.-L. Adam, J. W. Davis, A. Vlahou, L. Cazares, M. Ward, P. F. Schellhammer, J. Semmes, and G. L. Wright. Application of support vector machines to the classification of proteinchip system mass spectral data of prostate cancer serum samples (abstract). In Second Annual National Cancer Institute Early Detection Research Network Scientific Workshop, 2001. [ bib ]
[86]	P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. In Proceedings of the Fourteenth Annual Conference on Computational Learning Theory and Fifth European Conference on Computational Learning Theory, pages 224--240, 2001. [ bib ]
[87]	A. J. Smola and P. L. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619--625, 2001. [ bib ]
[88]	Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. IEEE Transactions on Information Theory, 48(1):239--250, 2002. [ bib ]
[89]	P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463--482, 2002. [ bib \| .pdf ]
[90]	P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85--113, 2002. [ bib \| .ps.gz ]
[91]	P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and System Sciences, 64(1):133--150, 2002. [ bib ]
[92]	P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. Theoretical Computer Science, 284(1):53--66, 2002. (special issue on Eurocolt'99). [ bib \| http ]
[93]	P. L. Bartlett, P. Fischer, and K.-U. Höffgen. Exploiting random walks for learning. Information and Computation, 176(2):121--135, 2002. [ bib \| http ]
[94]	L. Mason, P. L. Bartlett, and M. Golea. Generalization error of combined classifiers. Journal of Computer and System Sciences, 65(2):415--438, 2002. [ bib \| http ]
[95]	P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized Rademacher complexity. In Proceedings of the Conference on Computational Learning Theory, pages 44--58, 2002. [ bib ]
[96]	G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proceedings of the International Conference on Machine Learning, pages 323--330, 2002. [ bib ]
[97]	E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. In Advances in Neural Information Processing Systems 14, pages 1507--1514, 2002. [ bib \| .ps.gz ]
[98]	Peter L. Bartlett. Prediction algorithms: complexity, concentration and convexity. In Proceedings of the 13th IFAC Symposium on System Identification, pages 1507--1517, 2003. [ bib \| .ps.Z ] In this paper, we review two families of algorithms used to estimate large-scale statistical models for prediction problems, kernel methods and boosting algorithms. We focus on the computational and statistical properties of prediction algorithms of this kind. Convexity plays an important role for these algorithms, since they exploit the computational advantages of convex optimization procedures. However, in addition to its computational advantages, the use of convexity in these methods also confers some attractive statistical properties. We present some recent results that show the advantages of convexity for estimation rates, the rates at which the prediction accuracies approach their optimal values. In addition, we present results that quantify the cost of using a convex loss function in place of the real loss function of interest.
[99]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Technical Report 638, Department of Statistics, U.C. Berkeley, 2003. [ bib \| .ps.Z \| .pdf ] Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.
[100]	Peter L. Bartlett and Wolfgang Maass. Vapnik-Chervonenkis dimension of neural nets. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 1188--1192. MIT Press, 2003. Second Edition. [ bib \| .ps.gz \| .pdf ]
[101]	Peter L. Bartlett. An introduction to reinforcement learning theory: value function methods. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600, pages 184--202. Springer, 2003. [ bib ]
[102]	G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27--72, 2004. [ bib \| .ps.gz \| .pdf ]
[103]	E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471--1530, 2004. [ bib \| .pdf ]
[104]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The Annals of Statistics, 32(1):85--91, 2004. [ bib \| .ps.Z \| .pdf ]
[105]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Large margin classifiers: convex loss, low noise, and convergence rates. In Advances in Neural Information Processing Systems, 16, 2004. [ bib \| .ps.gz ] Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function---that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
[106]	Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. In Proceedings of the 17th Annual Conference on Learning Theory, volume 3120, pages 564--578. Springer, 2004. [ bib \| .ps.gz \| .pdf ] One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic bounds for the number of support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions.
[107]	Peter L. Bartlett, Shahar Mendelson, and Petra Philips. Local complexities for empirical risk minimization. In Proceedings of the 17th Annual Conference on Computational Learning Theory (COLT2004), volume 3120, pages 270--284. Springer, 2004. [ bib \| .ps.gz \| .pdf ] We present sharp bounds on the risk of the empirical minimization algorithm under mild assumptions on the class. We introduce the notion of isomorphic coordinate projections and show that this leads to a sharper error bound than the best previously known. The quantity which governs this bound on the empirical minimizer is the fixed point of the function (r) = sup {\|f - _n f\|: finF, f = r }. We prove that this is the best estimate one can obtain using `structural results', and that it is possible to estimate the error rate from data. We then prove that the bound on the empirical minimization algorithm can be improved further by a direct analysis, and that the correct error rate is the maximizer of (r) -r, where (r) = sup {f - _n f: finF, f = r }.
[108]	Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. In Proceedings of the 18th Annual Conference on Learning Theory, volume 3559, pages 143--157. Springer, 2005. [ bib \| .pdf ]
[109]	Peter L. Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponentiated gradient algorithms for large-margin structured classification. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 113--120, Cambridge, MA, 2005. MIT Press. [ bib \| .ps.gz \| .pdf ] We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of both Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined by Taskar et al, using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient -- even in cases where the number of labels y is exponential in size -- provided that certain expectations under Gibbs distributions can becalculated efficiently. The optimization method we use for structured labels relies on a more general result, specifically the application of exponentiated gradient (EG) updates to quadratic programs (QPs). We describe a new method for solving QPs based on these techniques, and give bounds on its rate of convergence. In addition to their application to the structured-labels task, the EG updates lead to simple algorithms for optimizing “conventional” binary or multiclass SVM problems. Finally, we give a new generalization bound for structured classification, using PAC-Bayesian methods for the analysis of large margin classifiers.
[110]	Rafael Jiménez-Rodriguez, Nicholas Sitar, and Peter L. Bartlett. Maximum likelihood estimation of trace length distribution parameters using the EM algorithm. In G. Barla and M. Barla, editors, Prediction, Analysis and Design in Geomechanical Applications: Proceedings of the Eleventh International Conference on Computer Methods and Advances in Geomechanics (IACMAG-2005), volume 1, pages 619--626, Bologna, 2005. Pàtron Editore. [ bib ]
[111]	Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497--1537, 2005. [ bib \| .ps \| .pdf ] We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to prediction with bounded loss, and to regression with a convex loss function and a convex function class.
[112]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138--156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [ bib \| .ps.gz \| .pdf ] Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.
[113]	Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311--334, 2006. [ bib \| .ps.gz \| .pdf ] We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes.
[114]	Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Technical report, U.C. Berkeley, 2006. [ bib \| .ps.gz \| .pdf ] We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss---the f-risk---also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1\|X) is unlikely to be close to certain critical values.
[115]	Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Technical report, U. C. Berkeley, 2006. [ bib ]
[116]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341--346, 2006. [ bib ]
[117]	Peter L. Bartlett and Mikhail Traskin. Adaboost and other large margin classifiers: Convexity in pattern classification. In Proceedings of the 5th Workshop on Defence Applications of Signal Processing, 2006. [ bib ]
[118]	Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization” by V. Koltchinskii. The Annals of Statistics, 34(6):2657--2663, 2006. [ bib ]
[119]	Peter L. Bartlett, Shahar Mendelson, and Petra Philips. Optimal sample-based estimates of the expectation of the empirical minimizer. Technical report, U.C. Berkeley, 2007. [ bib \| .ps.gz \| .pdf ] We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.
[120]	Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Technical Report 729, Department of Statistics, U.C. Berkeley, 2007. [ bib \| .pdf ] We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.
[121]	Peter L. Bartlett and Ambuj Tewari. Sample complexity of policy search with known dynamics. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 97--104, Cambridge, MA, 2007. MIT Press. [ bib \| .pdf ]
[122]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1193--1200, Cambridge, MA, 2007. MIT Press. [ bib \| .pdf ]
[123]	David Rosenberg and Peter L. Bartlett. The Rademacher complexity of co-regularized kernel classes. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2, pages 396--403, 2007. [ bib \| .pdf ] In the multi-view approach to semi-supervised learning, we choose one predictor from each of multiple hypothesis classes, and we `co-regularize' our choices by penalizing disagreement among the predictors on the unlabeled data. In this paper we examine the co-regularization method used in the recently proposed co-regularized least squares (CoRLS) algorithm. In this method we have two hypothesis classes, each a reproducing kernel Hilbert space (RKHS), and we co-regularize by penalizing the average squared difference in predictions on the unlabeled data. We get our final predictor by taking the pointwise average of the predictors from each view. We call the set of predictors that can result from this procedure the co-regularized hypothesis class. The main result of this paper is a tight bound on the Rademacher complexity of the co-regularized hypothesis class in terms of the kernel matrices of each RKHS. We find that the co-regularization reduces the Rademacher complexity of the hypothesis class by an amount depending on how different the two views are, measured by a data dependent metric. We then use standard techniques to bound the gap between training error and test error for the CoRLS algorithm. Experimentally, we find that the amount of reduction in complexity introduced by co-regularization correlates with the amount of improvement that co-regularization gives in the CoRLS algorithm
[124]	Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 105--112, Cambridge, MA, 2007. MIT Press. [ bib \| .pdf ] The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1-a) iterations---for sample size n and a>0---the sequence of risks of the classifiers it produces approaches the Bayes risk.
[125]	Ambuj Tewari and Peter L. Bartlett. Bounded parameter Markov decision processes with average reward criterion. In Proceedings of the Conference on Learning Theory, pages 263--277, 2007. [ bib ]
[126]	Jacob Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. In Proceedings of the Conference on Learning Theory, pages 484--498, 2007. [ bib ] We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the `ideal' algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient randomized algorithm based on Markov chain Monte Carlo techniques.
[127]	Alexander Rakhlin, Jacob Abernethy, and Peter L. Bartlett. Online discovery of similarity mappings. In Proceedings of the 24th International Conference on Machine Learning (ICML-2007), pages 767--774, 2007. [ bib ] We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the `Online Prediction with Expert Advice' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.
[128]	Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347--2368, 2007. [ bib \| .pdf \| .pdf ] The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n⁽1-a) iterations---for sample size n and 0<a<1---the sequence of risks of the classifiers it produces approaches the Bayes risk.
[129]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Technical report, EECS Department, University of California, Berkeley, 2007. [ bib \| .pdf ] We present new expected risk bounds for binary and multiclass prediction, and resolve several recent conjectures on sample compressibility due to Kuzmin and Warmuth. By exploiting the combinatorial structure of concept class F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy. The key step in their proof is a d=VC(F) bound on the graph density of a subgraph of the hypercube - one-inclusion graph. The first main result of this report is a density bound of n choose(n-1,<=d-1)/choose(n,<=d) < d, which positively resolves a conjecture of Kuzmin and Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved one-inclusion mistake bound. The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is an algebraic topological property of maximum classes of VC-dimension d as being d-contractible simplicial complexes, extending the well-known characterization that d=1 maximum classes are trees. We negatively resolve a minimum degree conjecture of Kuzmin and Warmuth - the second part to a conjectured proof of correctness for Peeling - that every class has one-inclusion minimum degree at most its VC-dimension. Our final main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This result improves on known PAC-based expected risk bounds by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.
[130]	David Rosenberg and Peter L. Bartlett. On bounds for Bayesian sequence prediction with non-Gaussian priors. Technical report, 2007. Technical Report. [ bib ] We present worst-case log-loss regret bounds for the Bayesian model averaging algorithm in the regression setting. Our work generalizes earlier work that gives bounds of a similar form that only hold for Gaussian priors. Our bounds hold for arbitrary priors, though the regret term now includes a term involving the modulus of continuity of the prior, as well as the value of the prior at the point of comparison.
[131]	Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Minimax lower bounds for online convex games. Technical report, UC Berkeley, 2007. [ bib \| .pdf ] A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction x from a convex set, the environment plays a loss function f, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when f is assumed to be convex, and Hazan et al, when f is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving lower bounds in each case. These results prove that the existing algorithms are essentially optimal.
[132]	Jacob Duncan Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. Technical Report UCB/EECS-2007-20, EECS Department, University of California, Berkeley, 2007. [ bib \| .html ]
[133]	Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Technical report, U.C. Berkeley, 2007. [ bib \| .pdf ] Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log ( 1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.
[134]	Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775--790, April 2007. [ bib \| .html ]
[135]	Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007--1025, May 2007. (Invited paper). [ bib \| .html ]
[136]	Alekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization techniques for online multitask learning. Technical Report UCB/EECS-2008-138, EECS Department, University of California, Berkeley, 2008. [ bib \| .pdf ] In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.
[137]	Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory, 24(2):545--552, April 2008. (Was Department of Statistics, U.C. Berkeley Technical Report number 729, 2007). [ bib \| DOI \| .pdf ] We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.
[138]	Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823--1840, August 2008. [ bib \| .pdf ] We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss---the f-risk---also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1\|X) is unlikely to be close to certain critical values.
[139]	Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9:1775--1822, August 2008. [ bib \| .pdf ] Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log ( 1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.
[140]	Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In John Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors, Advances in Neural Information Processing Systems 20, pages 65--72, Cambridge, MA, September 2008. MIT Press. [ bib \| .pdf ] We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between sqrt(T) and log T. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.
[141]	Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In John Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1505--1512, Cambridge, MA, September 2008. MIT Press. [ bib \| .pdf ] We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates: a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P)log T of the reward obtained by the optimal policy, where C(P) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.
[142]	Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, September 2008. [ bib \| .pdf ]
[143]	Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Fault diagnostics and supervised testing: How fault diagnostic tools can be proactive? In K. Grigoriadis, editor, Proceedings of Intelligent Systems and Control (ISC 2008), pages 633--034, September 2008. [ bib ]
[144]	Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Overcoming the complexity of diagnostic problems due to sensor network architecture. In K. Grigoriadis, editor, Proceedings of Intelligent Systems and Control (ISC 2008), pages 633--071, September 2008. [ bib ]
[145]	Massieh Najafi, David M. Auslander, Peter L. Bartlett, and Philip Haves. Application of machine learning in fault diagnostics of mechanical systems. In Proceedings of the World Congress on Engineering and Computer Science 2008: International Conference on Modeling, Simulation and Control 2008, pages 957--962, October 2008. [ bib \| .pdf ]
[146]	Marco Barreno, Peter L. Bartlett, F. J. Chi, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, U. Saini, and J. Doug Tygar. Open problems in the security of learning. In Proceedings of the 1st ACM Workshop on AISec (AISec2008), pages 19--26, October 2008. [ bib \| DOI ]
[147]	Peter L. Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 335--342, December 2008. [ bib \| .pdf ]
[148]	Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), pages 415--423, December 2008. [ bib \| .pdf ]
[149]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):37--59, January 2009. (Was University of California, Berkeley, EECS Department Technical Report EECS-2007-86). [ bib \| .pdf ]
[150]	Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A stochastic view of optimal regret through minimax duality. Technical Report 0903.5328, arxiv.org, 2009. [ bib \| http ] We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.
[151]	A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. Technical Report 0912.1155, arxiv.org, 2009. [ bib \| http ] Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender's strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker's incentives and knowledge.
[152]	Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacy preserving mechanisms for SVM learning. Technical Report 0911.5708, arxiv.org, 2009. [ bib \| http ] Several recent studies in privacy-preserving learning have considered the trade-off between utility or risk and the level of differential privacy guaranteed by mechanisms for statistical query processing. In this paper we study this trade-off in private Support Vector Machine (SVM) learning. We present two efficient mechanisms, one for the case of finite-dimensional feature mappings and one for potentially infinite-dimensional feature mappings with translation-invariant kernels. For the case of translation-invariant kernels, the proposed mechanism minimizes regularized empirical risk in a random Reproducing Kernel Hilbert Space whose kernel uniformly approximates the desired kernel with high probability. This technique, borrowed from large-scale learning, allows the mechanism to respond with a finite encoding of the classifier, even when the function class is of infinite VC dimension. Differential privacy is established using a proof technique from algorithmic stability. Utility--the mechanism's response function is pointwise epsilon-close to non-private SVM with probability 1-delta--is proven by appealing to the smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. We conclude with a lower bound on the optimal differential privacy of the SVM. This negative result states that for any delta, no mechanism can be simultaneously (epsilon,delta)-useful and beta-differentially private for small epsilon and small beta.
[153]	Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI2009), pages 35--42, June 2009. [ bib \| .pdf ]
[154]	Jacob Abernethy, Alekh Agarwal, Peter L. Bartlett, and Alexander Rakhlin. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory -- COLT 2009, pages 257--266, June 2009. [ bib \| .pdf ] We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional---the minimizer over the player's actions of expected loss---defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.
[155]	Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1--9, June 2009. [ bib \| .pdf ] Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding very critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to learning and estimation.
[156]	David S. Rosenberg, Vikas Sindhwani, Peter L. Bartlett, and Partha Niyogi. Multiview point cloud kernels for semisupervised learning. IEEE Signal Processing Magazine, 26(5):145--150, September 2009. [ bib \| DOI ]
[157]	Marius Kloft, Ulrich Rückert, and Peter L. Bartlett. A unifying view of multiple kernel learning. Technical Report 1005.0437, arxiv.org, 2010. [ bib \| http ] Recent research on multiple kernel learning has lead to a number of approaches for combining kernels in regularized risk minimization. The proposed approaches include different formulations of objectives and varying regularization strategies. In this paper we present a unifying general optimization criterion for multiple kernel learning and show how existing formulations are subsumed as special cases. We also derive the criterion's dual representation, which is suitable for general smooth optimization algorithms. Finally, we evaluate multiple kernel learning in this framework analytically using a Rademacher complexity bound on the generalization error and empirically in a set of experiments.
[158]	Jacob Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. Technical Report 1011.1936, arxiv.org, 2010. [ bib \| http ] We consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. We show that Blackwell's result is equivalent, via efficient reductions, to the existence of 'no-regret' algorithms for Online Linear Optimization. Indeed, we show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide a useful application of this reduction: the first efficient algorithm for calibrated forecasting.
[159]	A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. In Proceedings of Financial Cryptography and Data Security (FC10), pages 192--206, 2010. [ bib \| DOI ]
[160]	Peter L. Bartlett, Shahar Mendelson, and Petra Philips. On the optimality of sample-based estimates of the expectation of the empirical minimizer. ESAIM: Probability and Statistics, 14:315--337, January 2010. [ bib \| .pdf ] We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.
[161]	Alekh Agarwal, Peter L. Bartlett, and Max Dama. Optimal allocation strategies for the dark pool problem. In Y. W. Teh and M. Titterington, editors, Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 9, pages 9--16, May 2010. [ bib \| .pdf ] We study the problem of allocating stocks to dark pools. We propose and analyze an optimal approach for allocations, if continuous-valued allocations are allowed. We also propose a modification for the case when only integer-valued allocations are possible. We extend the previous work on this problem to adversarial scenarios, while also improving on those results in the iid setup. The resulting algorithms are efficient, and perform well in simulations under stochastic and adversarial inputs.
[162]	Peter L. Bartlett. Learning to act in uncertain environments. Communications of the ACM, 53(5):98, May 2010. (Invited one-page comment). [ bib \| DOI ]
[163]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Corrigendum to `shifting: One-inclusion mistake bounds and sample compression' [J. Comput. System Sci 75 (1) (2009) 37-59]. Journal of Computer and System Sciences, 76(3--4):278--280, May 2010. [ bib \| DOI ]
[164]	Brian Kulis and Peter L. Bartlett. Implicit online learning. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 575--582, June 2010. [ bib \| .pdf ]
[165]	Marius Kloft, Ulrich Rückert, and Peter L. Bartlett. A unifying view of multiple kernel learning. In José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, editors, Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD, pages 66--81, September 2010. Part II, LNAI 6322. [ bib \| DOI ]
[166]	Jacob Abernethy, Peter L. Bartlett, Niv Buchbinder, and Isabelle Stanton. A regularization approach to metrical task systems. In Marcus Hutter, Frank Stephan, Vladimir Vovk, and Thomas Zeugmann, editors, Algorithmic Learning Theory, 21st International Conference, ALT 2010, pages 270--284, October 2010. [ bib \| DOI ]
[167]	Peter L. Bartlett. Optimal online prediction in adversarial environments. In Marcus Hutter, Frank Stephan, Vladimir Vovk, and Thomas Zeugmann, editors, Algorithmic Learning Theory, 21st International Conference, ALT 2010, page 34, October 2010. (Plenary talk abstract). [ bib \| DOI ]
[168]	Sylvain Arlot and Peter L. Bartlett. Margin-adaptive model selection in statistical learning. Bernoulli, 17(2):687--713, May 2011. [ bib \| .pdf ]
[169]	Afshin Rostamizadeh, Alekh Agarwal, and Peter L. Bartlett. Learning with missing features. In Avi Pfeffer and Fabio G. Cozman, editors, Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI2011), pages 635--642, July 2011. [ bib \| .pdf ]
[170]	Jacob Abernethy, Peter L. Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning are equivalent. In Sham Kakade and Ulrike von Luxburg, editors, Proceedings of the Conference on Learning Theory (COLT2011), volume 19, pages 27--46, July 2011. [ bib \| .pdf ]
[171]	Alekh Agarwal, John Duchi, Peter L. Bartlett, and Clement Levrard. Oracle inequalities for computationally budgeted model selection. In Sham Kakade and Ulrike von Luxburg, editors, Proceedings of the Conference on Learning Theory (COLT2011), volume 19, pages 69--86, July 2011. [ bib \| .pdf ] We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the effects of computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget.
[172]	John Shawe-Taylor, Richard Zemel, Peter L. Bartlett, Fernando Pereira, and Kilian Weinberger, editors. Advances in Neural Information Processing Systems 24. Proceedings of the 2011 Conference. NIPS Foundation, December 2011. [ bib \| .html ]
[173]	Alekh Agarwal, Peter L. Bartlett, and John Duchi. Oracle inequalities for computationally adaptive model selection. Technical Report 1208.0129, arxiv.org, 2012. [ bib \| http ] We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our computational budget to the optimal function class.
[174]	John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized Smoothing for (Parallel) Stochastic Optimization. In 2012 IEEE 51st Annual Conference on Decision and Control (CDC), IEEE Conference on Decision and Control, pages 5442--5444, 345 E 47th St, New York, NY 10017 USA, 2012. IEEE. [ bib ] By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for non-smooth optimization. A combination of our techniques with recent work on decentralized optimization yields order-optimal parallel stochastic optimization algorithms. We give applications of our results to several statistical machine learning problems, providing experimental results (in the full version of the paper) demonstrating the effectiveness of our algorithms.
[175]	Fares Hedayati and Peter L. Bartlett. Exchangeability characterizes optimality of sequential normalized maximum likelihood and Bayesian prediction with Jeffreys prior. In M. Girolami and N. Lawrence, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 22, pages 504--510, April 2012. [ bib \| .pdf ] We study online prediction of individual sequences under logarithmic loss with parametric experts. The optimal strategy, normalized maximum likelihood (NML), is computationally demanding and requires the length of the game to be known. We consider two simpler strategies: sequential normalized maximum likelihood (SNML), which computes the NML forecasts at each round as if it were the last round, and Bayesian prediction. Under appropriate conditions, both are known to achieve near-optimal regret. In this paper, we investigate when these strategies are optimal. We show that SNML is optimal iff the joint distribution on sequences defined by SNML is exchangeable. In the case of exponential families, this is equivalent to the optimality of any Bayesian prediction strategy, and the optimal prior is Jeffreys prior.
[176]	Alekh Agarwal, Peter Bartlett, Pradeep Ravikumar, and Martin Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235--3249, May 2012. [ bib \| DOI \| .pdf ] Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.
[177]	John Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674--701, June 2012. [ bib \| .pdf ] We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for nonsmooth optimization. We give several applications of our results to statistical estimation problems and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is order-optimal.
[178]	Fares Hedayati and Peter Bartlett. The optimality of Jeffreys prior for online density estimation and the asymptotic normality of maximum likelihood estimators. In Proceedings of the Conference on Learning Theory (COLT2012), volume 23, pages 7.1--7.13, June 2012. [ bib \| .pdf ] We study online learning under logarithmic loss with regular parametric models. We show that a Bayesian strategy predicts optimally only if it uses Jeffreys prior. This result was known for canonical exponential families; we extend it to parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.
[179]	A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. IEEE Transactions on Dependable and Secure Computing, 9(4):482--493, July 2012. [ bib \| http \| .pdf ] Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender’s strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker’s incentives and knowledge.
[180]	Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacy preserving mechanisms for SVM learning. Journal of Privacy and Confidentiality, 4(1):65--100, August 2012. [ bib \| http ]
[181]	Massieh Najafi, David M. Auslander, Peter L. Bartlett, Philip Haves, and Michael D. Sohn. Application of machine learning in the fault diagnostics of air handling units. Applied Energy, 96:347--358, August 2012. [ bib \| DOI ]
[182]	Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman. l_1-regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 154(1--2):193--224, October 2012. [ bib \| DOI \| .pdf ] We study the predictive performance of _1-regularized linear regression in a model-free setting, including the case where the number of covariates is substantially larger than the sample size. We introduce a new analysis method that avoids the boundedness problems that typically arise in model-free empirical minimization. Our technique provides an answer to a conjecture of Greenshtein and Ritov [?] regarding the “persistence” rate for linear regression and allows us to prove an oracle inequality for the error of the regularized minimizer. It also demonstrates that empirical risk minimization gives optimal rates (up to log factors) of convex aggregation of a set of estimators of a regression function.
[183]	Peter L. Bartlett, Fernando Pereira, Chris J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors. Advances in Neural Information Processing Systems 25. Proceedings of the 2012 Conference. NIPS Foundation, December 2012. [ bib \| .html ]
[184]	Peter L. Bartlett, Peter Grunwald, Peter Harremoes, Fares Hedayati, and Wojciech Kotlowski. Horizon-independent optimal prediction with log-loss in exponential families. In Proceedings of the Conference on Learning Theory (COLT2013), volume 30, pages 639--661, 2013. [ bib \| .pdf ] We study online learning under logarithmic loss with regular parametric models. Hedayati and Bartlett (2012) showed that a Bayesian prediction strategy with Jeffreys prior and sequential normalized maximum likelihood (SNML) coincide and are optimal if and only if the latter is exchangeable, which occurs if and only if the optimal strategy can be calculated without knowing the time horizon in advance. They put forward the question what families have exchangeable SNML strategies. We answer this question for one-dimensional exponential families: SNML is exchangeable only for three classes of natural exponential family distributions,namely the Gaussian, the gamma, and the Tweedie exponential family of order 3/2.
[185]	Yevgeny Seldin, Koby Crammer, and Peter L Bartlett. Open problem: Adversarial multiarmed bandits with limited advice. In Proceedings of the Conference on Learning Theory (COLT2013), volume 30, pages 1067--1072, 2013. [ bib \| .pdf ]
[186]	Jacob Abernethy, Peter L. Bartlett, Rafael Frongillo, and Andre Wibisono. How to hedge an option against an adversary: Black-Scholes pricing is minimax optimal. In Advances in Neural Information Processing Systems 26, pages 2346--2354, 2013. [ bib \| http \| .pdf ] We consider a popular problem in finance, option pricing, through the lens of an online learning game between Nature and an Investor. In the Black-Scholes option pricing model from 1973, the Investor can continuously hedge the risk of an option by trading the underlying asset, assuming that the asset's price fluctuates according to Geometric Brownian Motion (GBM). We consider a worst-case model, in which Nature chooses a sequence of price fluctuations under a cumulative quadratic volatility constraint, and the Investor can make a sequence of hedging decisions. Our main result is to show that the value of our proposed game, which is the “regret” of hedging strategy, converges to the Black-Scholes option price. We use significantly weaker assumptions than previous work---for instance, we allow large jumps in the asset price---and show that the Black-Scholes hedging strategy is near-optimal for the Investor even in this non-stochastic framework.
[187]	Yasin Abbasi-Yadkori, Peter L. Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. Online learning in Markov decision processes with adversarially chosen transition probability distributions. In Advances in Neural Information Processing Systems 26, pages 2508--2516, 2013. [ bib \| http \| .pdf ] We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O(sqrt(Tlog \|Π\|)+log \|Π\|) regret with respect to a comparison set of policies Π. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set Π has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.
[188]	Ambuj Tewari and Peter L. Bartlett. Learning theory. In Paulo S.R. Diniz, Johan A.K. Suykens, Rama Chellappa, and Sergios Theodoridis, editors, Signal Processing Theory and Machine Learning, volume 1 of Academic Press Library in Signal Processing, pages 775--816. Elsevier, 2014. [ bib ]
[189]	Yevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with limited advice and multiarmed bandits with paid observations. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 280--287, 2014. [ bib \| .html \| .pdf ] We study two basic questions in online learning. The first question is what happens between full-information and limited-feedback games and the second question is the cost of information acquisition in online learning. The questions are addressed by defining two variations of standard online learning games. In the first variation, prediction with limited advice, we consider a game of prediction with expert advice, where on each round of the game we query the advice of a subset of M out of N experts. We present an algorithm that achieves O(sqrt((N/M)TlnN)) regret on T rounds of this game. The second variation, the multiarmed bandit with paid observations, is a variant of the adversarial N-armed bandit game, where on round t of the game, we can observe the reward of any number of arms, but each observation has a cost c. We present an algorithm that achieves O((c N lnN)^1/3 T^2/3) regret on T rounds of this game. We present lower bounds that show that, apart from the logarithmic factors, these regret bounds cannot be improved.
[190]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Varun Kanade. Tracking adversarial targets. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 369--377, 2014. [ bib \| .html \| .pdf ] We study linear quadratic problems with adversarial tracking targets. We propose a Follow The Leader algorithm and show that, under a stability condition, its regret grows as the logarithm of the number of rounds of the game. We also study a problem with adversarially chosen transition dynamics, for which an exponentially-weighted average algorithm is proposed and analyzed.
[191]	J. Hyam Rubinstein, Benjamin Rubinstein, and Peter Bartlett. Bounding embeddings of VC classes into maximum classes. Technical Report 1401.7388, arXiv.org, 2014. [ bib \| http ] One of the earliest conjectures in computational learning theory---the Sample Compression Conjecture---asserts that concept classes (or set systems) admit compression schemes of size polynomial in their VC dimension. To-date this statement is known to be true for maximum classes---those that meet Sauer's Lemma, which bounds class cardinality in terms of VC dimension, with equality. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings extend the known compression schemes to all VC classes. We show that maximum classes can be characterized by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterization of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k.
[192]	J. Hyam Rubinstein, Benjamin Rubinstein, and Peter Bartlett. Bounding embeddings of VC classes into maximum classes. In A. Gammerman and V. Vovk, editors, Festschrift of Alexey Chervonenkis. Springer, 2014. [ bib \| http ] One of the earliest conjectures in computational learning theory---the Sample Compression Conjecture---asserts that concept classes (or set systems) admit compression schemes of size polynomial in their VC dimension. To-date this statement is known to be true for maximum classes---those that meet Sauer's Lemma, which bounds class cardinality in terms of VC dimension, with equality. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings extend the known compression schemes to all VC classes. We show that maximum classes can be characterized by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterization of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k.
[193]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Alan Malek. Linear programming for large-scale Markov decision problems. Technical Report 1402.6763, arXiv.org, 2014. [ bib \| http ] We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low-dimensional subset of the set of stationary distributions (defined in terms of state-action features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.
[194]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Alan Malek. Linear programming for large-scale Markov decision problems. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 496--504, 2014. [ bib \| .html \| .pdf ] We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low-dimensional subset of the set of stationary distributions (defined in terms of state-action features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.
[195]	Wouter M Koolen, Alan Malek, and Peter L Bartlett. Efficient minimax strategies for square loss games. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3230--3238. Curran Associates, Inc., 2014. [ bib \| .pdf ] We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance. We derive the minimax solutions for the case where the prediction and action spaces are the simplex (this setup is sometimes called the Brier game) and the _2 ball (this setup is related to Gaussian density estimation). We show that in both cases the value of each sub-game is a quadratic function of a simple statistic of the state, with coefficients that can be efficiently computed using an explicit recurrence relation. The resulting deterministic minimax strategy and randomized maximin strategy are linear functions of the statistic.
[196]	Alex Kantchelian, Michael C Tschantz, Ling Huang, Peter L Bartlett, Anthony D Joseph, and J. Doug Tygar. Large-margin convex polytope machine. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3248--3256. Curran Associates, Inc., 2014. [ bib \| .pdf ] We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.
[197]	Wouter Koolen, Alan Malek, Peter L. Bartlett, and Yasin Abbasi-Yadkori. Minimax time series prediction. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2548--2556. Curran Associates, Inc., 2015. [ bib \| .pdf ] We consider an adversarial formulation of the problem of predicting a time series with square loss. The aim is to predict an arbitrary sequence of vectors almost as well as the best smooth comparator sequence in retrospect. Our approach allows natural measures of smoothness, such as the squared norm of increments. More generally, we can consider a linear time series model and penalize the comparator sequence through the energy of the implied driving noise terms. We derive the minimax strategy for all problems of this type, and we show that it can be implemented efficiently. The optimal predictions are linear in the previous observations. We obtain an explicit expression for the regret in terms of the parameters defining the problem. For typical, simple definitions of smoothness, the computation of the optimal predictions involves only sparse matrices. In the case of norm-constrained data, where the smoothness is defined in terms of the squared norm of the comparator's increments, we show that the regret grows as T/sqrt(\lambda, where T is the length of the game and λ specifies the smoothness of the comparator.)
[198]	Yasin Abbasi-Yadkori, Wouter Koolen, Alan Malek, and Peter L. Bartlett. Minimax time series prediction. Technical report, EECS Department, University of California, Berkeley, 2015. [ bib ]
[199]	Walid Krichene, Alexandre Bayen, and Peter L. Bartlett. Accelerating mirror descent in continuous and discrete time. Technical report, EECS Department, University of California, Berkeley, 2015. [ bib ]
[200]	Walid Krichene, Alexandre Bayen, and Peter L. Bartlett. Accelerating mirror descent in continuous and discrete time. In C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Garnett, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2827--2835. Curran Associates, Inc., 2015. [ bib \| .pdf ] We study accelerated mirror descent dynamics in continuous and discrete time. Combining the original continuous-time motivation of mirror descent with a recent ODE interpretation of Nesterov's accelerated method, we propose a family of continuous-time descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories are guaranteed to converge to the optimum at a (1/t^2) rate. We then show that a large family of first-order accelerated methods can be obtained as a discretization of the ODE, and these methods converge at a (1/k^2) rate. This connection between accelerated mirror descent and the ODE provides an intuitive approach to the design and analysis of accelerated first-order algorithms.
[201]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Stephen Wright. A Lagrangian relaxation approach to Markov decision problems. Technical report, UC Berkeley EECS, 2015. [ bib ] We study Markov decision problems (MDPs) over a restricted policy class, and show that a Lagrangian relaxation approach finds near-optimal policies in this class efficiently. In particular, the computational complexity depends on the number of features used to define policies, and not on the size of the state space. The statistical complexity also scales well: our method requires only low-dimensional second order statistics. Most value-function-based methods for MDPs return a policy that is greedy with respect to the value function estimate. We discuss drawbacks of this approach, and propose a new policy class defined for some parameter vector w by π_w(a\|x) = ( 1-Q_w(x,a) + _ν(.\|x) Q_w ) ν(a\|x), where Q_w is the state-action value function, ν is a baseline policy, and the mean of Q_w under ν(.\|x) acts as a normalizer. Similar to the greedy and Gibbs policies, the proposed policy assigns larger probabilities to actions with smaller value-function estimates. We demonstrate the effectiveness of our Lagrangian relaxation approach, applied to this policy class, on a queueing problem and an energy storage application.
[202]	Peter L. Bartlett. Online prediction. Technical report, UC Berkeley EECS, 2015. [ bib \| .pdf ] We review game-theoretic models of prediction, in which the process generating the data is modelled as an adversary with whom the prediction method competes. We present a formulation that encompasses a wide variety of decision problems, and focus on the relationship between prediction in this game-theoretic setting and prediction in the more standard probabilistic setting. In particular, we present a view of standard prediction strategies as Bayesian decision methods, and we show how the regret of optimal strategies depends on complexity measures that are closely related to those that appear in probabilistic settings.
[203]	Yasin Abbasi-Yadkori, Peter L Bartlett, Xi Chen, and Alan Malek. Large-scale Markov decision problems with KL control cost. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), volume 37, pages 1053--1062, June 2015. [ bib \| .html \| .pdf ]
[204]	Peter L. Bartlett, Wouter Koolen, Alan Malek, Eiji Takimoto, and Manfred Warmuth. Minimax fixed-design linear regression. In Proceedings of the Conference on Learning Theory (COLT2015), volume 40, pages 226--239, June 2015. [ bib \| .pdf \| .pdf ] We consider a linear regression game in which the covariates are known in advance: at each round, the learner predicts a real-value, the adversary reveals a label, and the learner incurs a squared error loss. The aim is to minimize the regret with respect to linear predictions. For a variety of constraints on the adversary's labels, we show that the minimax optimal strategy is linear, with a parameter choice that is reminiscent of ordinary least squares (and as easy to compute). The predictions depend on all covariates, past and future, with a particular weighting assigned to future covariates corresponding to the role that they play in the minimax regret. We study two families of label sequences: box constraints (under a covariate compatibility condition), and a weighted 2-norm constraint that emerges naturally from the analysis. The strategy is adaptive in the sense that it requires no knowledge of the constraint set. We obtain an explicit expression for the minimax regret for these games. For the case of uniform box constraints, we show that, with worst case covariate sequences, the regret is O(dlog T), with no dependence on the scaling of the covariates.
[205]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Stephen Wright. A fast and reliable policy improvement algorithm. In Proceedings of AISTATS 2016, pages 1338--1346, 2016. [ bib \| .html \| .pdf ] We introduce a simple, efficient method that improves stochastic policies for Markov decision processes. The computational complexity is the same as that of the value estimation problem. We prove that when the value estimation error is small, this method gives an improvement in performance that increases with certain variance properties of the initial policy and transition dynamics. Performance in numerical experiments compares favorably with previous policy improvement algorithms.
[206]	Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and Peter L. Bartlett. Improved learning complexity in combinatorial pure exploration bandits. In Proceedings of AISTATS 2016, pages 1004--1012, 2016. [ bib \| .html \| .pdf ] We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant.
[207]	Walid Krichene, Alexandre Bayen, and Peter L. Bartlett. Adaptive averaging in accelerated descent dynamics. In Advances in Neural Information Processing Systems 29, pages 2991--2999, 2016. [ bib \| http \| .pdf ] We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate η(t), and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument, we give sufficient conditions on η and w to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuous-time, prove that it preserves the quadratic convergence rate of accelerated first-order methods in discrete-time, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases.
[208]	Yasin Abbasi-Yadkori, Alan Malek, Peter L. Bartlett, and Victor Gabillon. Hit-and-run for sampling and planning in non-convex spaces. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 888--895, Fort Lauderdale, FL, USA, 2017. [ bib \| .pdf ] We propose the Hit-and-Run algorithm for planning and sampling problems in non-convex spaces. For sampling, we show the first analysis of the Hit-and-Run algorithm in non-convex spaces and show that it mixes fast as long as certain smoothness conditions are satisfied. In particular, our analysis reveals an intriguing connection between fast mixing and the existence of smooth measure-preserving mappings from a convex space to the non-convex space. For planning, we show advantages of Hit-and-Run compared to state-of-the-art planning methods such as Rapidly-Exploring Random Trees.
[209]	Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML-17), volume 70 of Proceedings of Machine Learning Research, pages 4140--4149. PMLR, 2017. [ bib \| .html \| .pdf ] In this paper, we consider regression problems with one-hidden-layer neural networks (1NNs). We distill some properties of activation functions that lead to local strong convexity in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective and most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs, squared ReLUs and sigmoids. For activation functions that are also smooth, we show local linear convergence guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity d ·log (1/ε) ·poly(k,λ) and computational complexity n·d ·poly(k,λ) for smooth homogeneous activations with high probability, where d is the dimension of the input, k (k<=d) is the number of hidden nodes, λ is a conditioning property of the ground-truth parameter matrix between the input layer and the hidden layer, ε is the targeted precision and n is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision.
[210]	Martin Péron, Kai Helge Becker, Peter L. Bartlett, and Iadine Chadès. Fast-tracking stationary MOMDPs for adaptive management problems. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), pages 4531--4537, 2017. [ bib \| http \| http ] Adaptive management is applied in conservation and natural resource management, and consists of making sequential decisions when the transition matrix is uncertain. Informally described as ’learning by doing’, this approach aims to trade off between decisions that help achieve the objective and decisions that will yield a better knowledge of the true transition matrix. When the true transition matrix is assumed to be an element of a finite set of possible matrices, solving a mixed observability Markov decision process (MOMDP) leads to an optimal trade-off but is very computationally demanding. Under the assumption (common in adaptive management) that the true transition matrix is stationary, we propose a polynomial-time algorithm to find a lower bound of the value function. In the corners of the domain of the value function (belief space), this lower bound is provably equal to the optimal value function. We also show that under further assumptions, it is a linear approximation of the optimal value function in a neighborhood around the corners. We evaluate the benefits of our approach by using it to initialize the solvers MO-SARSOP and Perseus on a novel computational sustainability problem and a recent adaptive management data challenge. Our approach leads to an improved initial value function and translates into significant computational gains for both solvers.
[211]	Yasin Abbasi-Yadkori, Peter L. Bartlett, and Victor Gabillon. Near minimax optimal players for the finite-time 3-expert prediction problem. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3033--3042. Curran Associates, Inc., 2017. [ bib \| .pdf \| .pdf ] We study minimax strategies for the online prediction problem with expert advice. It is conjectured that a simple adversary strategy, called Comb, is optimal in this game for any number of experts including the non asymptotic case where the number of experts is small. We make progress in this direction by showing that Comb is minimax optimal within an additive logarithmic error in the finite time three expert problems.
[212]	Peter Bartlett, Dylan Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6240--6249. Curran Associates, Inc., 2017. [ bib \| .pdf \| .pdf ] This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized spectral complexity: their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.
[213]	Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Technical Report 1703.02930, arXiv.org, 2017. [ bib \| http \| .pdf ] We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting W be the number of weights and L be the number of layers, we prove that the VC-dimension is O(W Llog(W)), and provide examples with VC-dimension Ω(W Llog (W/L)). This improves both the previously known upper bounds and lower bounds. In terms of the number U of non-linear units, we prove a tight bound Θ(W U) on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial.
[214]	Walid Krichene and Peter Bartlett. Acceleration and averaging in stochastic descent dynamics. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6796--6806. Curran Associates, Inc., 2017. [ bib \| .pdf \| .pdf ] We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.
[215]	Niladri Chatterji and Peter Bartlett. Alternating minimization for dictionary learning with random initialization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1997--2006. Curran Associates, Inc., 2017. [ bib \| .pdf \| .pdf ]
[216]	Fares Hedayati and Peter L. Bartlett. Exchangeability characterizes optimality of sequential normalized maximum likelihood and Bayesian prediction. IEEE Transactions on Information Theory, 63(10):6767--6773, October 2017. [ bib \| DOI \| .pdf \| .pdf ] We study online learning under logarithmic loss with regular parametric models. In this setting, each strategy corresponds to a joint distribution on sequences. The minimax optimal strategy is the normalized maximum likelihood (NML) strategy. We show that the sequential normalized maximum likelihood (SNML) strategy predicts minimax optimally (i.e. as NML) if and only if the joint distribution on sequences defined by SNML is exchangeable. This property also characterizes the optimality of a Bayesian prediction strategy. In that case, the optimal prior distribution is Jeffreys prior for a broad class of parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.
[217]	Peter L. Bartlett, Steven Evans, and Philip M. Long. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization. Technical Report 1804.05012, arXiv.org, 2018. [ bib \| http \| .pdf ] We show that any smooth bi-Lipschitz h can be represented exactly as a composition h_mo...h_1 of functions h_1,...,h_m that are close to the identity in the sense that each (h_i − Id) is Lipschitz, and the Lipschitz constant decreases inversely with the number m of functions composed. This implies that h can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fréchet derivatives with respect to the h_1,...,h_m, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.
[218]	Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1998--2007. PMLR, 2018. [ bib \| .html \| .pdf ] It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we present an analysis hinting that high similarity between concurrently processed gradients may be a cause of this performance degradation. We introduce the notion of gradient diversity that measures the dissimilarity between concurrent gradient updates, and show its key role in the convergence and generalization performance of mini-batch SGD. We also establish that heuristics similar to DropConnect, Langevin dynamics, and quantization, are provably diversity-inducing mechanisms, and provide experimental evidence indicating that these mechanisms can indeed enable the use of larger batches without sacrificing accuracy and lead to faster training in distributed learning. For example, in one of our experiments, for a convolutional neural network to reach 95% training accuracy on MNIST, using the diversity-inducing mechanism can reduce the training time by 30% in the distributed setting.
[219]	Xiang Cheng, Fred Roosta, Stefan Palombo, Peter Bartlett, and Michael Mahoney. Flag n’ flare: Fast linearly-coupled adaptive gradient methods. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 404--414. PMLR, 2018. [ bib \| .html \| .pdf ] We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications.
[220]	Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors, Proceedings of ALT2018, volume 83 of Proceedings of Machine Learning Research, pages 186--211. PMLR, 2018. [ bib \| .html \| .pdf ] Langevin diffusion is a commonly used tool for sampling from a given distribution. In this work, we establish that when the target density p^* is such that log p^* is L smooth and m strongly convex, discrete Langevin diffusion produces a distribution p with pp^≤ε in Õ((d)/(ε)) steps, where* d is the dimension of the sample space. We also study the convergence rate when the strong-convexity assumption is absent. By considering the Langevin diffusion as a gradient flow in the space of probability distributions, we obtain an elegant analysis that applies to the stronger property of convergence in KL-divergence and gives a conceptually simpler proof of the best-known convergence results in weaker metrics.
[221]	Martin Péron, Peter Bartlett, Kai Helge Becker, Kate Helmstedt, and Iadine Chadès. Two approximate dynamic programming algorithms for managing complete SIS networks. In ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS 2018), 2018. [ bib \| http \| .pdf ]
[222]	Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning (ICML-18), volume 80 of Proceedings of Machine Learning Research, pages 5650--5659. PMLR, 2018. [ bib \| .html \| .pdf ] we develop distributed optimization algorithms that are provably robust against Byzantine failures---arbitrary and potentially adversarial behavior, in distributed computing systems, with a focus on achieving optimal statistical performance. A main result of this work is a sharp analysis of two robust distributed gradient descent algorithms based on median and trimmed mean operations, respectively. We prove statistical error rates for all of strongly convex, non-strongly convex, and smooth non-convex population loss functions. In particular, these algorithms are shown to achieve order-optimal statistical error rates for strongly convex losses. To achieve better communication efficiency, we further propose a median-based distributed algorithm that is provably robust, and uses only one communication round. For strongly convex quadratic loss, we show that this algorithm achieves the same optimal error rate as the robust distributed gradient descent algorithms.
[223]	Niladri Chatterji, Nicolas Flammarion, Yian Ma, Peter Bartlett, and Michael Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning (ICML-18), volume 80 of Proceedings of Machine Learning Research, pages 764--773. PMLR, 2018. [ bib \| .html \| .pdf ] We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finite-sum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on real-world and synthetic datasets.
[224]	Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning (ICML-18), volume 80 of Proceedings of Machine Learning Research, pages 521--530. PMLR, 2018. [ bib \| http \| .pdf ] We analyze algorithms for approximating a function f(x) = Φx mapping ^d to ^d using deep linear neural networks, i.e. that learn a function h parameterized by matrices Θ_1, ..., Θ_L and defined by h(x) = Θ_LΘ_L−1...Θ_1x. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix Φ, in the case where the initial hypothesis Θ_1 = ...= Θ_L = I has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for Φ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If Φ is symmetric positive definite, we show that an algorithm that initializes Θ_i = I learns an ε-approximation of f using a number of updates polynomial in L, the condition number of Φ, and log (d/ε). In contrast, we show that if the least squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that Φ satisfies u^⊤Φu>0 for all u, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant u^⊤Θ_LΘ_L-1...Θ_1 u>0 for all u, and another that `balances' Θ_1, ..., Θ_L so that they have the same singular values.
[225]	Alan Malek and Peter L. Bartlett. Horizon-independent minimax linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5264--5273. Curran Associates, Inc., 2018. [ bib \| .pdf ] We consider a linear regression game: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error. The aim is to minimize the difference between the cumulative loss and that of the linear predictor that is best in hindsight. Previous work demonstrated that the minimax optimal strategy is easy to compute recursively from the end of the game; this requires the entire sequence of covariate vectors in advance. We show that, once provided with a measure of the scale of the problem, we can invert the recursion and play the minimax strategy without knowing the future covariates. Further, we show that this forward recursion remains optimal even against adaptively chosen labels and covariates, provided that the adversary adheres to a set of constraints that prevent misrepresentation of the scale of the problem. This strategy is horizon-independent, i.e. it incurs no more regret than the optimal strategy that knows in advance the number of rounds of the game. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer, and obtain an explicit expression for the minimax regret.
[226]	Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, and Michael I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference on Learning Theory (COLT2018), volume 75 of Proceedings of Machine Learning Research, pages 300--323. PMLR, 2018. [ bib \| .html \| .pdf ] We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves ε error (in 2-Wasserstein distance) in O(sqrt(d/ε) steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is O(d/ε^2) steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom.)
[227]	Yasin Abbasi-Yadkori, Peter L. Bartlett, Victor Gabillon, Alan Malek, and Michal Valko. Best of both worlds: Stochastic and adversarial best-arm identification. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference on Learning Theory (COLT2018), volume 75 of Proceedings of Machine Learning Research, pages 918--949. PMLR, 2018. [ bib \| http \| .pdf ] We study bandit best-arm identification with arbitrary and potentially adversarial rewards. A simple random uniform learner obtains the optimal rate of error in the adversarial scenario. However, this type of strategy is suboptimal when the rewards are sampled stochastically. Therefore, we ask: Can we design a learner that performs optimally in both the stochastic and adversarial problems while not being aware of the nature of the rewards? First, we show that designing such a learner is impossible in general. In particular, to be robust to adversarial rewards, we can only guarantee optimal rates of error on a subset of the stochastic problems. We give a lower bound that characterizes the optimal rate in stochastic problems if the strategy is constrained to be robust to adversarial rewards. Finally, we design a simple parameter-free algorithm and show that its probability of error matches (up to log factors) the lower bound in stochastic problems, and it is also robust to adversarial ones.
[228]	Kush Bhatia, Aldo Pacchiano, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan. Gen-Oja: Simple and efficient algorithm for streaming generalized eigenvector computation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7016--7025. Curran Associates, Inc., 2018. [ bib \| .pdf ] In this paper, we study the problems of principle Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-Time-Scale Stochastic Approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest.
[229]	Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. Technical Report arXiv:1805.01648 [stat.ML], arxiv.org, 2018. [ bib \| http ]
[230]	Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1--17, 2019. [ bib \| .html ]
[231]	Peter L. Bartlett, Victor Gabillon, and Michal Valko. A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption. In Aurélien Garivier and Satyen Kale, editors, Proceedings of the 30th International Conference on Algorithmic Learning Theory, volume 98 of Proceedings of Machine Learning Research, pages 184--206, Chicago, Illinois, 2019. PMLR. [ bib \| .html \| .pdf ] We study the problem of optimizing a function under a budgeted number of evaluations. We only assume that the function is locally smooth around one of its global optima. The difficulty of optimization is measured in terms of 1) the amount of noise b of the function evaluation and 2) the local smoothness, d, of the function. A smaller d results in smaller optimization error. We come with a new, simple, and parameter-free approach. First, for all values of b and d, this approach recovers at least the state-of-the-art regret guarantees. Second, our approach additionally obtains these results while being agnostic to the values of both b and d. This leads to the first algorithm that naturally adapts to an unknown range of noise b and leads to significant improvements in a moderate and low-noise regime. Third, our approach also obtains a remarkable improvement over the state-of-the-art SOO algorithm when the noise is very low which includes the case of optimization under deterministic feedback (b=0). There, under our minimal local smoothness assumption, this improvement is of exponential magnitude and holds for a class of functions that covers the vast majority of functions that practitioners optimize (d=0). We show that our algorithmic improvement is borne out in experiments as we empirically show faster convergence on common benchmarks.
[232]	Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, and Martin J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pages 2916--2925. PMLR, 2019. [ bib \| .html \| .pdf ] We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of a canonical stochastic, two-point, derivative-free method for linear-quadratic systems in which the initial state of the system is drawn at random. In particular, we show that for problems with effective dimension D, such a method converges to an ε-approximate solution within O(D/ε) steps, with multiplicative pre-factors that are explicit lower-order polynomial terms in the curvature parameters of the problem. Along the way, we also derive stochastic zero-order rates for a class of non-convex optimization problems.
[233]	Vidya Muthukumar, Mitas Ray, Anant Sahai, and Peter L. Bartlett. Best of many worlds: Robust model selection for online supervised learning. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), volume 89 of Proceedings of Machine Learning Research, pages 3177--3186. PMLR, 2019. [ bib \| .html \| .pdf ] We introduce algorithms for online, full-information prediction that are computationally efficient and competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We incorporate a novel probabilistic framework of structural risk minimization into existing adaptive algorithms and show that we can robustly learn not only the presence of stochastic structure when it exists, but also the correct model order. When the stochastic data is actually realized from a predictor in the model class considered, we obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the true model order and whether the process generating the data is stochastic or adversarial. In cases where the data does not arise from any of the models, our algorithm selects models of higher order as we play more rounds. We display empirically improved overall prediction error over other adversarially robust approaches.
[234]	Niladri Chatterji, Aldo Pacchiano, and Peter Bartlett. Online learning with kernel losses. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 971--980, Long Beach, California, USA, 2019. PMLR. [ bib \| .html \| .pdf ] We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigen-decay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigen-decay (μ_j <=O(j^-β)), we find that the regret is bounded by R_n <= O(n^β/2(β-1)). While under the assumption of exponential eigen-decay (μ_j <=O(e^-βj )) we get an even tighter bound on the regret R_n <= \mathcal{O}̃(n^1/2). When the eigen-decay is polynomial we also show a non-matching minimax lower bound on the regret of R_n >=Ω(n^(β+1)/2β) and a lower bound of R_n >=Ω(n^1/2) when the decay in the eigen-values is exponentially fast. We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm.
[235]	Peter L. Bartlett, Victor Gabillon, Jennifer Healey, and Michal Valko. Scale-free adaptive planning for deterministic dynamics and discounted rewards. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 495--504, Long Beach, California, USA, 2019. PMLR. [ bib \| .html \| .pdf ] We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce PlaTypOOS, an adaptive, robust, and efficient alternative to the OLOP (open-loop optimistic planning) algorithm. Whereas OLOP requires a priori knowledge of the ranges of both rewards and noise, PlaTypOOS dynamically adapts its behavior to both. This allows PlaTypOOS to be immune to two vulnerabilities of OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. PlaTypOOS additionally adapts to the global smoothness of the value function. PlaTypOOS acts in a provably more efficient manner vs. OLOP when OLOP is given an overestimated reward and show that in the case of no noise, PlaTypOOS learns exponentially faster.
[236]	Yasin Abbasi-Yadkori, Peter L. Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. POLITEX: Regret bounds for policy iteration using expert prediction. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3692--3702, Long Beach, California, USA, 2019. PMLR. [ bib \| .html \| .pdf ] We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for τ time steps scales as ε(τ) = ε_0 + O(sqrt(d/τ)), where ε_0 is the worst-case approximation error and d is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as O(d^1/2T^3/4 + ε_0T), where O(·) hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
[237]	Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter L. Bartlett. Defending against saddle point attack in Byzantine-robust distributed learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7074--7084, Long Beach, California, USA, 2019. PMLR. [ bib \| .html \| .pdf ] We study robust distributed learning that involves minimizing a non-convex loss function with saddle points. We consider the Byzantine setting where some worker machines have abnormal or even arbitrary and adversarial behavior, and in this setting, the Byzantine machines may create fake local minima near a saddle point that is far away from any true local minimum, even when robust gradient estimators are used. We develop ByzantinePGD, a robust first-order algorithm that can provably escape saddle points and fake local minima, and converge to an approximate true local minimizer with low iteration complexity. As a by-product, we give a simpler algorithm and analysis for escaping saddle points in the usual non-Byzantine setting. We further discuss three robust gradient estimators that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in concrete statistical settings, and argue for their near-optimality in low and high dimensional regimes.
[238]	Dong Yin, Ramchandran Kannan, and Peter L. Bartlett. Rademacher complexity for adversarially robust generalization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7085--7094, Long Beach, California, USA, 2019. PMLR. [ bib \| .html \| .pdf ] Many machine learning models are vulnerable to adversarial attacks; for example, adding adversarial perturbations that are imperceptible to humans can often make machine learning models produce wrong predictions with high confidence; moreover, although we may obtain robust models on the training dataset via adversarial training, in some problems the learned models cannot generalize well to the test data. In this paper, we focus on _ attacks, and study the adversarially robust generalization problem through the lens of Rademacher complexity. For binary linear classifiers, we prove tight bounds for the adversarial Rademacher complexity, and show that the adversarial Rademacher complexity is never smaller than its natural counterpart, and it has an unavoidable dimension dependence, unless the weight vector has bounded _1 norm, and our results also extend to multi-class linear classifiers; in addition, for (nonlinear) neural networks, we show that the dimension dependence in the adversarial Rademacher complexity also exists. We further consider a surrogate adversarial loss for one-hidden layer ReLU network and prove margin bounds for this setting. Our results indicate that having _1 norm constraints on the weight matrices might be a potential way to improve generalization in the adversarial setting. We demonstrate experimental results that validate our theoretical findings.
[239]	Kush Bhatia, Yi-An Ma, Anca D. Dragan, Peter L. Bartlett, and Michael I. Jordan. Bayesian robustness: A nonasymptotic viewpoint. Technical Report 1907.11826, arXiv, 2019. [ bib ]
[240]	Wenlong Mou, Nhat Ho, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. Sampling for Bayesian mixture models: MCMC with polynomial-time mixing. Technical Report 1912.05153, arXiv, 2019. [ bib ] We study the problem of sampling from the power posterior distribution in Bayesian Gaussian mixture models. The power posterior is a robust version of the classical posterior; it is known to be non-log-concave and multi-modal, which leads to exponential mixing times for some standard MCMC algorithms. We introduce and study the Reflected Metropolis-Hastings Random Walk (RMRW) algorithm for sampling from the power posterior distribution of two-component location Gaussian mixtures. For symmetric mixtures with mean parameters at −θ_0 and θ_0, we prove that its mixing time is bounded as d^1.5(d + θ_0^2)^4.5 as long as the sample size n is of the order d(d + θ_0^2). Notably, this result requires no conditions on the separation of the two means θ_0 and -θ_0. En route to proving this bound, we establish some new results, of possible independent interest, that allow for combining Poincare inequalities for conditional and marginal densities.
[241]	Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-gaussian rates. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 786--806. PMLR, 2019. [ bib \| .html \| .pdf ] We propose an estimator for the mean of a random vector in R^d that can be computed in time O(n^3.5+n^2d) for n i.i.d. samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins (2018), which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
[242]	Yeshwanth Cherapanamjeri and Peter L. Bartlett. Testing Markov chains without hitting. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the 32nd Conference on Learning Theory (COLT2019), volume 99 of Proceedings of Machine Learning Research, pages 758--785. PMLR, 2019. [ bib \| .html \| .pdf ] We study the problem of identity testing of Markov chains. In this setting, we are given access to a single trajectory from a Markov chain with unknown transition matrix Q and the goal is to determine whether Q = P for some known matrix P or Dist(P , Q)>=ε, where Dist is suitably defined. In recent work by Daskalakis et al. (2018a), it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of P, which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time, thus enabling efficient testing of Markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of Markov chains.
[243]	Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Neural Computation, 31:477--502, 2019. [ bib ]
[244]	Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. Technical Report 1910.00551, arXiv, 2019. [ bib ]
[245]	Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression. Technical Report arXiv:2009.14286, arxiv.org, 2020. [ bib \| http ] Classical learning theory suggests that strong regularization is needed to learn a class with large complexity. This intuition is in contrast with the modern practice of machine learning, in particular learning neural networks, where the number of parameters often exceeds the number of data points. It has been observed empirically that such overparametrized models can show good generalization performance even if trained with vanishing or negative regularization. The aim of this work is to understand theoretically how this effect can occur, by studying the setting of ridge regression. We provide non-asymptotic generalization bounds for overparametrized ridge regression that depend on the arbitrary covariance structure of the data, and show that those bounds are tight for a range of regularization parameter values. To our knowledge this is the first work that studies overparametrized ridge regression in such a general setting. We identify when small or negative regularization is sufficient for obtaining small generalization error. On the technical side, our bounds only require the data vectors to be i.i.d. sub-gaussian, while most previous work assumes independence of the components of those vectors.
[246]	Kush Bhatia, Ashwin Pananjady, Peter L. Bartlett, Anca Dragan, and Martin Wainwright. Preference learning along multiple criteria: A game-theoretic perspective. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 7413--7424. Curran Associates, Inc., 2020. [ bib \| .pdf ] The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well-known that any Nash equilibrium of the zero-sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects) known as a von Neumann winner. Many real-world problems, however, are inevitably multi-criteria, with different pairwise preferences governing the different criteria. In this work, we generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell’s approachability. Our framework allows for non-linear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization. From a theoretical standpoint, we show that the Blackwell winner of a multi-criteria problem instance can be computed as the solution to a convex optimization problem. Furthermore, given random samples of pairwise comparisons, we show that a simple, "plug-in" estimator achieves (near-)optimal minimax sample complexity. Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences.
[247]	Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, and Martin J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. Journal of Machine Learning Research, 21(21):1--51, 2020. [ bib \| .html ]
[248]	Niladri Chatterji, Vidya Muthukumar, and Peter L. Bartlett. OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1844--1854. PMLR, 26--28 Aug 2020. [ bib \| .html \| .pdf ] We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for their alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.
[249]	Niladri Chatterji, Jelena Diakonikolas, Michael I. Jordan, and Peter L. Bartlett. Langevin Monte Carlo without smoothness. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1716--1726. PMLR, 26--28 Aug 2020. [ bib \| .html \| .pdf ] Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood mainly in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. At a high level, our results follow by leveraging the implicit smoothing of the log-density that comes from a small Gaussian perturbation that we add to the iterates of the algorithm and controlling the bias and variance that are induced by this perturbation.
[250]	Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063--30070, 2020. (arXiv:1906.11300). [ bib \| DOI \| arXiv \| http ] The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.
[251]	Eric Mazumdar, Aldo Pacchiano, Yian Ma, Michael Jordan, and Peter Bartlett. On approximate Thompson sampling with Langevin algorithms. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6797--6807. PMLR, 13--18 Jul 2020. [ bib \| .html \| .pdf ] Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, its wider deployment is restricted due to a significant computational limitation: the need for samples from posterior distributions at every iteration. In practice, this limitation is alleviated by making use of approximate sampling methods, yet provably incorporating approximate samples into Thompson Sampling algorithms remains an open problem. In this work we address this by proposing two efficient Langevin MCMC algorithms tailored to Thompson sampling. The resulting approximate Thompson Sampling algorithms are efficiently implementable and provably achieve optimal instance-dependent regret for the Multi-Armed Bandit (MAB) problem. To prove these results we derive novel posterior concentration bounds and MCMC convergence rates for log-concave distributions which may be of independent interest.
[252]	Jonathan Lee, Aldo Pacchiano, Peter L. Bartlett, and Michael I. Jordan. Accelerated message passing for entropy-regularized MAP inference. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5736--5746. PMLR, 13--18 Jul 2020. [ bib \| .html \| .pdf ] Maximum a posteriori (MAP) inference in discrete-valued Markov random fields is a fundamental problem in machine learning that involves identifying the most likely configuration of random variables given a distribution. Due to the difficulty of this combinatorial problem, linear programming (LP) relaxations are commonly used to derive specialized message passing algorithms that are often interpreted as coordinate descent on the dual LP. To achieve more desirable computational properties, a number of methods regularize the LP with an entropy term, leading to a class of smooth message passing algorithms with convergence guarantees. In this paper, we present randomized methods for accelerating these algorithms by leveraging techniques that underlie classical accelerated gradient methods. The proposed algorithms incorporate the familiar steps of standard smooth message passing algorithms, which can be viewed as coordinate minimization steps. We show that these accelerated variants achieve faster rates for finding ε-optimal points of the unregularized problem, and, when the LP is tight, we prove that the proposed algorithms recover the true MAP solution in fewer iterations than standard message passing algorithms.
[253]	Xiang Cheng, Dong Yin, Peter Bartlett, and Michael Jordan. Stochastic gradient and Langevin processes. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1810--1819. PMLR, 13--18 Jul 2020. [ bib \| .html \| .pdf ] We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. We study the setup where the additive noise can be non-Gaussian and state-dependent and the potential function can be non-convex. We show that the key properties of these processes depend on the potential function and the second moment of the additive noise. We apply our theoretical findings to studying the convergence of Stochastic Gradient Descent (SGD) for non-convex problems and corroborate them with experiments using SGD to train deep neural networks on the CIFAR-10 dataset.
[254]	Hossein Mobahi, Mehrdad Farajtabar, and Peter L. Bartlett. Self-distillation amplifies regularization in Hilbert space. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33, volume 33, pages 3351--3361. Curran Associates, Inc., 2020. [ bib \| .pdf ] Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of why this happens. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
[255]	Yeshwanth Cherapanamjeri, Efe Aras, Nilesh Tripuraneni, Michael I. Jordan, Nicolas Flammarion, and Peter L. Bartlett. Optimal robust linear regression in nearly linear time. Technical Report 2007.08137, arXiv, 2020. [ bib ]
[256]	Wenlong Mou, Chris Junchi Li, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 2947--2997. PMLR, 2020. [ bib \| .html \| .pdf ] We undertake a precise study of the asymptotic and non-asymptotic properties of stochastic approximation procedures with Polyak-Ruppert averaging for solving a linear system Ā θ= b̄. When the matrix Ā is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical Polyak-Ruppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a non-asymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix Ā is not Hurwitz but only has non-negative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an O(1/T) rate in mean-squared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and non-asymptotic settings. We also show various applications of the main results, including the study of momentum-based stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.
[257]	Peter L. Bartlett and Philip M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. Technical Report arXiv:2010.08479, arxiv.org, 2020. [ bib \| http ] We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose---it can be bounded below by a constant when the true excess risk goes to zero.
[258]	Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett, and Michael I. Jordan. Optimal mean estimation without a variance. Technical Report 2011.12433, arXiv, 2020. [ bib ]
[259]	Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. [ bib \| DOI \| http ] The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
[260]	Peter L. Bartlett and Philip M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. Journal of Machine Learning Research, 22(204):1--15, 2021. arXiv:2010.08479. [ bib \| .html ]
[261]	Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1--41, 2021. [ bib \| .html ]
[262]	Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942--1992, 2021. [ bib \| DOI ] We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback–Leibler (KL) divergence as the objective functional. We show that an underdamped form of the Langevin algorithm performs accelerated gradient descent in this metric. To characterize the convergence of the algorithm, we construct a Lyapunov functional and exploit hypocoercivity of the underdamped Langevin algorithm. As an application, we show that accelerated rates can be obtained for a class of nonconvex functions with the Langevin algorithm.
[263]	Kush Bhatia, Peter L. Bartlett, Anca D. Dragan, and Jacob Steinhardt. Agnostic Learning with Unknown Utilities. In James R. Lee, editor, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), volume 185 of Leibniz International Proceedings in Informatics (LIPIcs), pages 55:1--55:20, Dagstuhl, Germany, 2021. Schloss Dagstuhl--Leibniz-Zentrum für Informatik. [ bib \| DOI \| http ]
[264]	Aldo Pacchiano, Mohammad Ghavamzadeh, Peter L. Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2827--2835. PMLR, 13--15 Apr 2021. [ bib \| .html \| .pdf ] We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown.
[265]	Raman Arora, Peter L. Bartlett, Poorya Mianjy, and Nathan Srebro. Dropout: Explicit forms and capacity control. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 351--361. PMLR, 18--24 Jul 2021. [ bib \| .html \| .pdf ] We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a distribution-dependent regularizer that equals the weighted trace-norm of the product of the factors. In deep learning, we show that the distribution-dependent regularizer due to dropout directly controls the Rademacher complexity of the underlying class of deep neural networks. These developments enable us to give concrete generalization error bounds for the dropout algorithm in both matrix completion as well as training deep neural networks.
[266]	Niladri S. Chatterji, Philip M. Long, and Peter L. Bartlett. When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations? In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of the 34th Conference on Learning Theory (COLT2021), volume 134 of Proceedings of Machine Learning Research, pages 927--1027, 2021. [ bib \| .html ] We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
[267]	Juan Perdomo, Max Simchowitz, Alekh Agarwal, and Peter L. Bartlett. Towards a dimension-free understanding of adaptive linear control. In Proceedings of the 34th Conference on Learning Theory (COLT2021), 2021. [ bib ] We study the problem of adaptive control of the linear quadratic regulator for systems in very high, or even infinite dimension. We demonstrate that while sublinear regret requires finite dimensional inputs, the ambient state dimension of the system need not be small in order to perform online control. We provide the first regret bounds for LQR which hold for infinite dimensional systems, replacing dependence on ambient dimension with more natural notions of problem complexity. Our guarantees arise from a novel perturbation bound for certainty equivalence which scales with the prediction error in estimating the system parameters, without requiring consistent parameter recovery in more stringent measures like the operator norm. When specialized to finite dimensional settings, our bounds recover near optimal dimension and time horizon dependence.
[268]	Peter L. Bartlett, Sebastien Bubeck, and Yeshwanth Cherapanamjeri. Adversarial examples in multi-layer random ReLU networks. Technical Report 2106.12611, arXiv, 2021. [ bib ] We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
[269]	Peter L. Bartlett, Sebastien Bubeck, and Yeshwanth Cherapanamjeri. Adversarial examples in multi-layer random ReLU networks. In Advances in Neural Information Processing Systems 34, 2021. [ bib ] We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
[270]	Aldo Pacchiano, Jonathan Lee, Peter L. Bartlett, and Ofir Nachum. Near optimal policy optimization via reps. In Advances in Neural Information Processing Systems 34, 2021. [ bib ] Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.
[271]	Niladri Chatterji, Aldo Pacchiano, Peter L. Bartlett, and Michael I. Jordan. On the theory of reinforcement learning with once-per-episode feedback. In Advances in Neural Information Processing Systems 34, 2021. [ bib ] We introduce a theory of reinforcement learning (RL) in which the learner receives feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either `good' or `bad,' but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sub-linear regret.
[272]	Niladri S. Chatterji, Philip M. Long, and Peter L. Bartlett. When does gradient descent with logistic loss find interpolating two-layer networks? Journal of Machine Learning Research, 22(159):1--48, 2021. [ bib \| http ] We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
[273]	Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. Journal of Machine Learning Research, 23(233):1--50, 2022. [ bib \| .html ]
[274]	Niladri S. Chatterji, Peter L. Bartlett, and Philip M. Long. Oracle lower bounds for stochastic gradient sampling algorithms. Bernoulli, 28(2):1074--1092, 2022. arXiv:2002.00291. [ bib \| http ] We consider the problem of sampling from a strongly log-concave density in R^d, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least ε away from the target in total variation distance if the number of gradient queries is less than Omega(sigma²d/epsilon²), where sigma²d is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.
[275]	Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. Improved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577--1601, 2022. [ bib \| DOI \| http ] We present an improved analysis of the Euler-Maruyama discretization of the Langevin diffusion. Our analysis does not require global contractivity, and yields polynomial dependence on the time horizon. Compared to existing approaches, we make an additional smoothness assumption, and improve the existing rate from O(eta) to O(eta²) in terms of the KL divergence. This result matches the correct order for numerical SDEs, without suffering from exponential time dependence. When applied to algorithms for sampling and learning, this result simultaneously improves all those methods based on Dalayan's approach.
[276]	Wenlong Mou, Ashwin Pananjady, Martin J. Wainwright, and Peter L. Bartlett. Optimal and instance-dependent guarantees for Markovian linear stochastic approximation. In Proceedings of the 35th Conference on Learning Theory (COLT2022), 2022. [ bib ] We study stochastic approximation procedures for approximately solving a d-dimensional linear fixed point equation based on observing a trajectory of length n from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order t_mix dn on the squared error of the last iterate of a standard scheme, where t_mix is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters (d, t_mix) in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise---covering the TD(λ) family of algorithms for all λin[0, 1)---and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of λ when running the TD(λ) algorithm).
[277]	Spencer Frei, Niladri Chatterji, and Peter L. Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Proceedings of the 35th Conference on Learning Theory (COLT2022), 2022. [ bib ] Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve test error close to the Bayes-optimal error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear.
[278]	Peter L. Bartlett, Piotr Indyk, and Tal Wagner. Generalization bounds for data-driven numerical linear algebra. In Proceedings of the 35th Conference on Learning Theory (COLT2022), 2022. [ bib ] Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main result is an almost tight bound on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al. (NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.
[279]	Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter L. Bartlett, and Michael I. Jordan. Optimal mean estimation without a variance. In Proceedings of the 35th Conference on Learning Theory (COLT2022), 2022. [ bib ] We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample X = {X_i}_i = 1^n from a distribution D over R^d with mean μ which satisfies the following weak-moment assumption for some αin[0, 1]: for allv = 1: E_X D[X - μv^1 + α] <=1, and given a target failure probability, δ, our goal is to design an estimator which attains the smallest possible confidence interval as a function of n,d,δ. For the specific case of α= 1, foundational work of Lugosi and Mendelson exhibits an estimator achieving subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general α, and establish the following information-theoretic lower bound on the optimal attainable confidence interval: Ωsqrt((d)/(n)) + (d)/(n)^(α)/((1 + α)) + (log 1 / δ)/(n)^(α)/((1 + α)). Moreover, we devise a computationally-efficient estimator which achieves this lower bound.
[280]	Niladri S. Chatterji, Philip M. Long, and Peter L. Bartlett. The interplay between implicit bias and benign overfitting in two-layer linear networks. Journal of Machine Learning Research, 23(263):1--48, 2022. [ bib \| .html ]
[281]	Wenlong Mou, Martin J. Wainwright, and Peter L. Bartlett. Off-policy estimation of linear functionals: Non-asymptotic theory for semi-parametric efficiency. Technical Report 2209.13075, arXiv, 2022. [ bib ] The problem of estimating a linear functional based on observational data is canonical in both the causal inference and bandit literatures. We analyze a broad class of two-stage procedures that first estimate the treatment effect function, and then use this quantity to estimate the linear functional. We prove non-asymptotic upper bounds on the mean-squared error of such procedures: these bounds reveal that in order to obtain non-asymptotically optimal procedures, the error in estimating the treatment effect should be minimized in a certain weighted L2-norm. We analyze a two-stage procedure based on constrained regression in this weighted norm, and establish its instance-dependent optimality in finite samples via matching non-asymptotic local minimax lower bounds. These results show that the optimal non-asymptotic risk, in addition to depending on the asymptotically efficient variance, depends on the weighted norm distance between the true outcome function and its approximation by the richest function class supported by the sample size.
[282]	Peter L. Bartlett, Philip M. Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Technical Report 2210.01513, arXiv, 2022. [ bib ] We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM’s update may be regarded as a third derivative—the derivative of the Hessian in the leading eigenvector direction—that encourages drift toward wider minima.
[283]	Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24(123):1--76, 2023. (See also arXiv:2009.14286). [ bib \| .html ]
[284]	Spencer Frei, Niladri S. Chatterji, and Peter L. Bartlett. Random feature amplification: Feature learning and generalization in neural networks. Journal of Machine Learning Research, 24(303):1--49, 2023. [ bib \| .html ] In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics `amplify' these weak, random features to strong, useful features.
[285]	Wenlong Mou, Peng Ding, Martin J. Wainwright, and Peter L. Bartlett. Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency. Technical Report 2301.06240, arXiv, 2023. [ bib ] We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class F, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When F is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with sqrt n-consistency) and the slower minimax rates associated with non-parametric function estimation.
[286]	Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, and Wei Hu. Implicit bias in leaky ReLU networks trained on high-dimensional data. In Proceedings of ICLR 2023, 2023. To appear. [ bib ] The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an _2-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.
[287]	Juan C. Perdomo, Akshay Krishnamurthy, Peter L. Bartlett, and Sham Kakade. A complete characterization of linear estimators for offline policy evaluation. Journal of Machine Learning Research, 24(284):1--50, 2023. [ bib \| .html ] Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.
[288]	Aldo Pacchiano, Peter L. Bartlett, and Michael I. Jordan. An instance-dependent analysis for the cooperative multi-player multi-armed bandit. In Shipra Agrawal and Francesco Orabona, editors, Proceedings of the 34th International Conference on Algorithmic Learning Theory, volume 201 of Proceedings of Machine Learning Research, pages 1166--1215. PMLR, 2023. [ bib \| .html \| .pdf ] We study the problem of information sharing and cooperation in Multi-Player Multi-Armed bandits. We propose the first algorithm that achieves logarithmic regret for this problem. Our results are based on two innovations. First, we show that a simple modification to a successive elimination strategy can be used to allow the players to estimate their suboptimality gaps, up to constant factors, in the absence of collisions. Second, we leverage the first result to design a communication protocol that successfully uses the small reward of collisions to coordinate among players, while preserving meaningful instance-dependent logarithmic regret guarantees.
[289]	Spencer Frei, Gal Vardi, Peter L. Bartlett, and Nathan Srebro. Benign overfitting in linear classifiers and leaky ReLU networks from KKT conditions for margin maximization. In Proceedings of the 36th Conference on Learning Theory (COLT2023), 2023. To Appear. Also arXiv:2303.01462. [ bib ] Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush-Kuhn-Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data. The settings include variants of the noisy class-conditional Gaussians considered in previous work as well as new distributional settings where benign overfitting has not been previously observed. The key ingredient to our proof is the observation that when the training data is nearly-orthogonal, both linear classifiers and leaky ReLU networks satisfying the KKT conditions for their respective margin maximization problems behave like a nearly uniform average of the training examples.
[290]	Peter L. Bartlett, Philip M. Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Journal of Machine Learning Research, 24(316):1--36, 2023. [ bib \| .html ] We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM’s update may be regarded as a third derivative—the derivative of the Hessian in the leading eigenvector direction—that encourages drift toward wider minima.
[291]	Spencer Frei, Gal Vardi, Peter L. Bartlett, and Nathan Srebro. The double-edged sword of implicit bias: Generalization vs. robustness in ReLU networks. Technical Report 2303.01456, arXiv, 2023. [ bib ] In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples. Despite the potential for harmful overfitting in such overparameterized settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial l2-perturbations), even though robust networks that fit the data exist.
[292]	Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context. Technical Report 2306.09927, arXiv, 2023. [ bib ] Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models’ predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts
[293]	Philip M. Long and Peter L. Bartlett. Sharpness-aware minimization and the edge of stability. Technical Report 2309.12488, arXiv, 2023. [ bib ]
[294]	Spencer Frei, Gal Vardi, Peter L. Bartlett, and Nathan Srebro. The double-edged sword of implicit bias: Generalization vs. robustness in ReLU networks. In Advances in Neural Information Processing Systems 36, 2023. [ bib ]
[295]	Kush Bhatia, Yi-An Ma, Anca D. Dragan, Peter L. Bartlett, and Michael I. Jordan. Bayesian robustness: A nonasymptotic viewpoint. Journal of the American Statistical Association, 119(546):1112--1123, 2023. [ bib \| DOI ]
[296]	Peter L. Bartlett and Philip M. Long. Corrigendum to 'prediction, learning, uniform convergence, and scale-sensitive dimensions'. Journal of Computer and System Sciences, 140:103465, 2024. [ bib ]
[297]	Wenlong Mou, Nhat Ho, Martin Wainwright, Peter L. Bartlett, and Michael Jordan. A diffusion process perspective on posterior contraction rates for parameters. SIAM Journal on Mathematics of Data Science, 2024. (To appear). [ bib ]
[298]	Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1--55, 2024. [ bib \| .html ] Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models’ predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts
[299]	Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter L. Bartlett. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024. [ bib \| http ]
[300]	Saptarshi Chakraborty and Peter L. Bartlett. A statistical analysis of Wasserstein autoencoders for intrinsically low-dimensional data. In The Twelfth International Conference on Learning Representations, 2024. [ bib \| http ]
[301]	Pierre Marion, Anna Korba, Peter L. Bartlett, Mathieu Blondel, Valentin De Bortoli, Arnaud Doucet, Felipe Llinares-López, Courtney Paquette, and Quentin Berthet. Implicit diffusion: Efficient optimization through stochastic sampling. Technical Report 2402.05468, arXiv, 2024. [ bib ] We present a new algorithm to optimize distributions defined implicitly by parameterized stochastic diffusions. Doing so allows us to modify the outcome distribution of sampling processes by optimizing over their parameters. We introduce a general framework for first-order optimization of these processes, that performs jointly, in a single loop, optimization and sampling steps. This approach is inspired by recent advances in bilevel optimization and automatic implicit differentiation, leveraging the point of view of sampling as optimization over the space of probability distributions. We provide theoretical guarantees on the performance of our method, as well as experimental results demonstrating its effectiveness in real-world settings.
[302]	Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, and Bin Yu. Large stepsize gradient descent for logistic loss: Non-monotonicity of the loss improves optimization efficiency. In Proceedings of the 37th Conference on Learning Theory (COLT2024), 2024. (To appear). [ bib ] We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize η is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly --- in O(η) steps, and subsequently achieves an Õ(1 / (ηt) ) convergence rate after t additional steps. Our results imply that, given a budget of T steps, GD can achieve an accelerated loss of Õ(1/T^2) with an aggressive stepsize η:= Θ( T), without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the Õ(1/T^2) acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

This file was generated by bibtex2html 1.99.