Peter Bartlett's Journal Papers and Book Chapters

[ZFB24]	Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1--55, 2024. [ bib \| .html ] Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models’ predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts
[MHW⁺24]	Wenlong Mou, Nhat Ho, Martin Wainwright, Peter L. Bartlett, and Michael Jordan. A diffusion process perspective on posterior contraction rates for parameters. SIAM Journal on Mathematics of Data Science, 2024. (To appear). [ bib ]
[BL24]	Peter L. Bartlett and Philip M. Long. Corrigendum to 'prediction, learning, uniform convergence, and scale-sensitive dimensions'. Journal of Computer and System Sciences, 140:103465, 2024. [ bib ]
[BMD⁺23]	Kush Bhatia, Yi-An Ma, Anca D. Dragan, Peter L. Bartlett, and Michael I. Jordan. Bayesian robustness: A nonasymptotic viewpoint. Journal of the American Statistical Association, 119(546):1112--1123, 2023. [ bib \| DOI ]
[BLB23]	Peter L. Bartlett, Philip M. Long, and Olivier Bousquet. The dynamics of sharpness-aware minimization: Bouncing across ravines and drifting towards wide minima. Journal of Machine Learning Research, 24(316):1--36, 2023. [ bib \| .html ] We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM’s update may be regarded as a third derivative—the derivative of the Hessian in the leading eigenvector direction—that encourages drift toward wider minima.
[PKBK23]	Juan C. Perdomo, Akshay Krishnamurthy, Peter L. Bartlett, and Sham Kakade. A complete characterization of linear estimators for offline policy evaluation. Journal of Machine Learning Research, 24(284):1--50, 2023. [ bib \| .html ] Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.
[FCB23]	Spencer Frei, Niladri S. Chatterji, and Peter L. Bartlett. Random feature amplification: Feature learning and generalization in neural networks. Journal of Machine Learning Research, 24(303):1--49, 2023. [ bib \| .html ] In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics `amplify' these weak, random features to strong, useful features.
[TB23]	Alexander Tsigler and Peter L. Bartlett. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24(123):1--76, 2023. (See also arXiv:2009.14286). [ bib \| .html ]
[CLB22]	Niladri S. Chatterji, Philip M. Long, and Peter L. Bartlett. The interplay between implicit bias and benign overfitting in two-layer linear networks. Journal of Machine Learning Research, 23(263):1--48, 2022. [ bib \| .html ]
[MFWB22b]	Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. Improved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity. Bernoulli, 28(3):1577--1601, 2022. [ bib \| DOI \| http ] We present an improved analysis of the Euler-Maruyama discretization of the Langevin diffusion. Our analysis does not require global contractivity, and yields polynomial dependence on the time horizon. Compared to existing approaches, we make an additional smoothness assumption, and improve the existing rate from O(eta) to O(eta^2) in terms of the KL divergence. This result matches the correct order for numerical SDEs, without suffering from exponential time dependence. When applied to algorithms for sampling and learning, this result simultaneously improves all those methods based on Dalayan's approach.
[CBL22]	Niladri S. Chatterji, Peter L. Bartlett, and Philip M. Long. Oracle lower bounds for stochastic gradient sampling algorithms. Bernoulli, 28(2):1074--1092, 2022. arXiv:2002.00291. [ bib \| http ] We consider the problem of sampling from a strongly log-concave density in R^d, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least ε away from the target in total variation distance if the number of gradient queries is less than Omega(sigma^2d/epsilon^2), where sigma^2d is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.
[MFWB22a]	Wenlong Mou, Nicolas Flammarion, Martin J. Wainwright, and Peter L. Bartlett. An efficient sampling algorithm for non-smooth composite potentials. Journal of Machine Learning Research, 23(233):1--50, 2022. [ bib \| .html ]
[CLB21]	Niladri S. Chatterji, Philip M. Long, and Peter L. Bartlett. When does gradient descent with logistic loss find interpolating two-layer networks? Journal of Machine Learning Research, 22(159):1--48, 2021. [ bib \| http ] We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
[MCC⁺21]	Yi-An Ma, Niladri S. Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L. Bartlett, and Michael I. Jordan. Is there an analog of Nesterov acceleration for gradient-based MCMC? Bernoulli, 27(3):1942--1992, 2021. [ bib \| DOI ] We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback–Leibler (KL) divergence as the objective functional. We show that an underdamped form of the Langevin algorithm performs accelerated gradient descent in this metric. To characterize the convergence of the algorithm, we construct a Lyapunov functional and exploit hypocoercivity of the underdamped Langevin algorithm. As an application, we show that accelerated rates can be obtained for a class of nonconvex functions with the Langevin algorithm.
[MMW⁺21]	Wenlong Mou, Yi-An Ma, Martin J. Wainwright, Peter L. Bartlett, and Michael I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1--41, 2021. [ bib \| .html ]
[BL21]	Peter L. Bartlett and Philip M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. Journal of Machine Learning Research, 22(204):1--15, 2021. arXiv:2010.08479. [ bib \| .html ]
[BMR21]	Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. [ bib \| DOI \| http ] The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.
[BLLT20]	Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063--30070, 2020. (arXiv:1906.11300). [ bib \| DOI \| arXiv \| http ] The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. By studying examples of data covariance properties that this characterization shows are required for benign overfitting, we find an important role for finite-dimensional data: the accuracy of the minimum norm interpolating prediction rule approaches the best possible accuracy for a much narrower range of properties of the data distribution when the data lie in an infinite-dimensional space vs. when the data lie in a finite-dimensional space with dimension that grows faster than the sample size.
[MPB⁺20]	Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, and Martin J. Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. Journal of Machine Learning Research, 21(21):1--51, 2020. [ bib \| .html ]
[BHL19]	Peter L. Bartlett, David P. Helmbold, and Philip M. Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. Neural Computation, 31:477--502, 2019. [ bib ]
[BHLM19]	Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1--17, 2019. [ bib \| .html ]
[HB17]	Fares Hedayati and Peter L. Bartlett. Exchangeability characterizes optimality of sequential normalized maximum likelihood and Bayesian prediction. IEEE Transactions on Information Theory, 63(10):6767--6773, October 2017. [ bib \| DOI \| .pdf \| .pdf ] We study online learning under logarithmic loss with regular parametric models. In this setting, each strategy corresponds to a joint distribution on sequences. The minimax optimal strategy is the normalized maximum likelihood (NML) strategy. We show that the sequential normalized maximum likelihood (SNML) strategy predicts minimax optimally (i.e. as NML) if and only if the joint distribution on sequences defined by SNML is exchangeable. This property also characterizes the optimality of a Bayesian prediction strategy. In that case, the optimal prior distribution is Jeffreys prior for a broad class of parametric models for which the maximum likelihood estimator is asymptotically normal. The optimal prediction strategy, normalized maximum likelihood, depends on the number n of rounds of the game, in general. However, when a Bayesian strategy is optimal, normalized maximum likelihood becomes independent of n. Our proof uses this to exploit the asymptotics of normalized maximum likelihood. The asymptotic normality of the maximum likelihood estimator is responsible for the necessity of Jeffreys prior.
[RRB14]	J. Hyam Rubinstein, Benjamin Rubinstein, and Peter Bartlett. Bounding embeddings of VC classes into maximum classes. In A. Gammerman and V. Vovk, editors, Festschrift of Alexey Chervonenkis. Springer, 2014. [ bib \| http ] One of the earliest conjectures in computational learning theory---the Sample Compression Conjecture---asserts that concept classes (or set systems) admit compression schemes of size polynomial in their VC dimension. To-date this statement is known to be true for maximum classes---those that meet Sauer's Lemma, which bounds class cardinality in terms of VC dimension, with equality. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings extend the known compression schemes to all VC classes. We show that maximum classes can be characterized by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterization of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k.
[TB14]	Ambuj Tewari and Peter L. Bartlett. Learning theory. In Paulo S.R. Diniz, Johan A.K. Suykens, Rama Chellappa, and Sergios Theodoridis, editors, Signal Processing Theory and Machine Learning, volume 1 of Academic Press Library in Signal Processing, pages 775--816. Elsevier, 2014. [ bib ]
[BMN12]	Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman. l₁-regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 154(1--2):193--224, October 2012. [ bib \| DOI \| .pdf ] We study the predictive performance of ₁-regularized linear regression in a model-free setting, including the case where the number of covariates is substantially larger than the sample size. We introduce a new analysis method that avoids the boundedness problems that typically arise in model-free empirical minimization. Our technique provides an answer to a conjecture of Greenshtein and Ritov [?] regarding the “persistence” rate for linear regression and allows us to prove an oracle inequality for the error of the regularized minimizer. It also demonstrates that empirical risk minimization gives optimal rates (up to log factors) of convex aggregation of a set of estimators of a regression function.
[NAB⁺12]	Massieh Najafi, David M. Auslander, Peter L. Bartlett, Philip Haves, and Michael D. Sohn. Application of machine learning in the fault diagnostics of air handling units. Applied Energy, 96:347--358, August 2012. [ bib \| DOI ]
[RBHT12]	Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacy preserving mechanisms for SVM learning. Journal of Privacy and Confidentiality, 4(1):65--100, August 2012. [ bib \| http ]
[BRS⁺12]	A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. IEEE Transactions on Dependable and Secure Computing, 9(4):482--493, July 2012. [ bib \| http \| .pdf ] Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender’s strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker’s incentives and knowledge.
[DBW12]	John Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674--701, June 2012. [ bib \| .pdf ] We analyze convergence rates of stochastic optimization algorithms for nonsmooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for nonsmooth optimization. We give several applications of our results to statistical estimation problems and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is order-optimal.
[ABRW12]	Alekh Agarwal, Peter Bartlett, Pradeep Ravikumar, and Martin Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235--3249, May 2012. [ bib \| DOI \| .pdf ] Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.
[AB11]	Sylvain Arlot and Peter L. Bartlett. Margin-adaptive model selection in statistical learning. Bernoulli, 17(2):687--713, May 2011. [ bib \| .pdf ]
[RBR10]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Corrigendum to `shifting: One-inclusion mistake bounds and sample compression' [J. Comput. System Sci 75 (1) (2009) 37-59]. Journal of Computer and System Sciences, 76(3--4):278--280, May 2010. [ bib \| DOI ]
[Bar10]	Peter L. Bartlett. Learning to act in uncertain environments. Communications of the ACM, 53(5):98, May 2010. (Invited one-page comment). [ bib \| DOI ]
[BMP10]	Peter L. Bartlett, Shahar Mendelson, and Petra Philips. On the optimality of sample-based estimates of the expectation of the empirical minimizer. ESAIM: Probability and Statistics, 14:315--337, January 2010. [ bib \| .pdf ] We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.
[RSBN09]	David S. Rosenberg, Vikas Sindhwani, Peter L. Bartlett, and Partha Niyogi. Multiview point cloud kernels for semisupervised learning. IEEE Signal Processing Magazine, 26(5):145--150, September 2009. [ bib \| DOI ]
[RBR09]	Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):37--59, January 2009. (Was University of California, Berkeley, EECS Department Technical Report EECS-2007-86). [ bib \| .pdf ]
[LBW08]	Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, September 2008. [ bib \| .pdf ]
[CGK⁺08]	Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9:1775--1822, August 2008. [ bib \| .pdf ] Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log ( 1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.
[BW08]	Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823--1840, August 2008. [ bib \| .pdf ] We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss---the f-risk---also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1\|X) is unlikely to be close to certain critical values.
[Bar08]	Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory, 24(2):545--552, April 2008. (Was Department of Statistics, U.C. Berkeley Technical Report number 729, 2007). [ bib \| DOI \| .pdf ] We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.
[TB07]	Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007--1025, May 2007. (Invited paper). [ bib \| .html ]
[BT07a]	Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775--790, April 2007. [ bib \| .html ]
[BT07b]	Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347--2368, 2007. [ bib \| .pdf \| .pdf ] The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1-a) iterations---for sample size n and 0<a<1---the sequence of risks of the classifiers it produces approaches the Bayes risk.
[BM06a]	Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization” by V. Koltchinskii. The Annals of Statistics, 34(6):2657--2663, 2006. [ bib ]
[BJM06a]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341--346, 2006. [ bib ]
[BM06b]	Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311--334, 2006. [ bib \| .ps.gz \| .pdf ] We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes.
[BJM06b]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138--156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [ bib \| .ps.gz \| .pdf ] Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.
[BBM05]	Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497--1537, 2005. [ bib \| .ps \| .pdf ] We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to prediction with bounded loss, and to regression with a convex loss function and a convex function class.
[BJM04]	Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The Annals of Statistics, 32(1):85--91, 2004. [ bib \| .ps.Z \| .pdf ]
[GBB04]	E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471--1530, 2004. [ bib \| .pdf ]
[LCB⁺04]	G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27--72, 2004. [ bib \| .ps.gz \| .pdf ]
[Bar03]	Peter L. Bartlett. An introduction to reinforcement learning theory: value function methods. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600, pages 184--202. Springer, 2003. [ bib ]
[BM03]	Peter L. Bartlett and Wolfgang Maass. Vapnik-Chervonenkis dimension of neural nets. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 1188--1192. MIT Press, 2003. Second Edition. [ bib \| .ps.gz \| .pdf ]
[MBG02]	L. Mason, P. L. Bartlett, and M. Golea. Generalization error of combined classifiers. Journal of Computer and System Sciences, 65(2):415--438, 2002. [ bib \| http ]
[BFH02]	P. L. Bartlett, P. Fischer, and K.-U. Höffgen. Exploiting random walks for learning. Information and Computation, 176(2):121--135, 2002. [ bib \| http ]
[BBD02]	P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. Theoretical Computer Science, 284(1):53--66, 2002. (special issue on Eurocolt'99). [ bib \| http ]
[BB02]	P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and System Sciences, 64(1):133--150, 2002. [ bib ]
[BBL02]	P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85--113, 2002. [ bib \| .ps.gz ]
[BM02]	P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463--482, 2002. [ bib \| .pdf ]
[GBSTW02]	Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. IEEE Transactions on Information Theory, 48(1):239--250, 2002. [ bib ]
[BBW01]	J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351--381, 2001. [ bib \| .html ]
[BB01]	J. Baxter and P. L. Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319--350, 2001. [ bib \| .html ]
[MBB00]	L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 38(3):243--255, 2000. [ bib ]
[KBB00]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. Automatica, 36(9):1301--1307, 2000. [ bib ]
[SSWB00]	B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207--1245, 2000. [ bib ]
[PPB00]	S. Parameswaran, M. F. Parkinson, and P. L. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):1263--1274, 2000. [ bib ]
[BBDK00]	P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41(2):153--174, 2000. [ bib ]
[SBSS00]	A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Introduction to large margin classifiers. In Advances in Large Margin Classifiers, pages 1--29. MIT Press, 2000. [ bib ]
[MBBF00]	L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221--246. MIT Press, 2000. [ bib ]
[AB00]	M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9:213--225, 2000. [ bib ]
[BST99]	P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -- Support Vector Learning, pages 43--54. MIT Press, 1999. [ bib ]
[Bar99]	P. L. Bartlett. Efficient neural network learning. In V. D. Blondel, E. D. Sontag, M. Vidyasagar, and J. C. Willems, editors, Open Problems in Mathematical Systems Theory and Control, pages 35--38. Springer Verlag, 1999. [ bib ]
[BL99]	P. L. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44(1):55--62, 1999. [ bib ]
[Bar98]	P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525--536, 1998. [ bib ]
[BL98]	P. L. Bartlett and P. M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2):174--190, 1998. (special issue on COLT`95). [ bib ]
[KBB98]	L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Optimal controller properties from closed-loop experiments. Automatica, 34(1):83--91, 1998. [ bib ]
[BV98]	P. L. Bartlett and M. Vidyasagar. Introduction to the special issue on learning theory. Systems and Control Letters, 34:113--114, 1998. [ bib ]
[BK98]	P. L. Bartlett and S. Kulkarni. The complexity of model classes, and smoothing noisy data. Systems and Control Letters, 34(3):133--140, 1998. [ bib ]
[BLL98]	P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802--1813, 1998. [ bib ]
[STBWA98]	J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998. [ bib ]
[LBW98]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974--1980, 1998. [ bib ]
[BMM98]	P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10(8):2159--2173, 1998. [ bib ]
[SFBL98]	R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651--1686, 1998. [ bib ]
[LBW97]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Correction to `lower bounds on the VC-dimension of smoothly parametrized function classes'. Neural Computation, 9:765--769, 1997. [ bib ]
[Bar97]	P. L. Bartlett. Book review: `Neural networks for pattern recognition,' Christopher M. Bishop. Statistics in Medicine, 16(20):2385--2386, 1997. [ bib ]
[BKP97]	P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes. IEEE Transactions on Information Theory, 43(5):1721--1724, 1997. [ bib ]
[BW96]	P. L. Bartlett and R. C. Williamson. The Vapnik-Chervonenkis dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8:653--656, 1996. [ bib ]
[BLW96]	P. L. Bartlett, P. M. Long, and R. C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434--452, 1996. (special issue on COLT`94). [ bib ]
[ABIST96]	M. Anthony, P. L. Bartlett, Y. Ishai, and J. Shawe-Taylor. Valid generalisation from approximate interpolation. Combinatorics, Probability, and Computing, 5:191--214, 1996. [ bib ]
[LBW96]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118--2132, 1996. [ bib ]
[LBW95]	W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VC-dimension of smoothly parametrized function classes. Neural Computation, 7:990--1002, 1995. (See also correction, Neural Computation, 9: 765--769, 1997). [ bib ]
[Bar94]	P. L. Bartlett. Computational learning theory. In A. Kent and J. G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 31, pages 83--99. Marcel Dekker, 1994. [ bib ]
[Bar93]	P. L. Bartlett. Vapnik-Chervonenkis dimension bounds for two- and three-layer networks. Neural Computation, 5(3):371--373, 1993. [ bib ]
[BD92]	P. L. Bartlett and T. Downs. Using random weights to train multi-layer networks of hard-limiting units. IEEE Transactions on Neural Networks, 3(2):202--210, 1992. [ bib ]
[LBD92]	D. R. Lovell, P. L. Bartlett, and T. Downs. Error and variance bounds on sigmoidal neurons with weight and input errors. Electronics Letters, 28(8):760--762, 1992. [ bib ]

This file was generated by bibtex2html 1.99.