# Peter Bartlett's 2006-2021 Papers

 [BJM06b] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138--156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [ bib | .ps.gz | .pdf ] Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions. [BM06b] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311--334, 2006. [ bib | .ps.gz | .pdf ] We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes. [BW06] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Technical report, U.C. Berkeley, 2006. [ bib | .ps.gz | .pdf ] We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss---the f-risk---also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1|X) is unlikely to be close to certain critical values. [BT06b] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Technical report, U. C. Berkeley, 2006. [ bib ] [BJM06a] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341--346, 2006. [ bib ] [BT06a] Peter L. Bartlett and Mikhail Traskin. Adaboost and other large margin classifiers: Convexity in pattern classification. In Proceedings of the 5th Workshop on Defence Applications of Signal Processing, 2006. [ bib ] [BM06a] Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization” by V. Koltchinskii. The Annals of Statistics, 34(6):2657--2663, 2006. [ bib ] [BMP07] Peter L. Bartlett, Shahar Mendelson, and Petra Philips. Optimal sample-based estimates of the expectation of the empirical minimizer. Technical report, U.C. Berkeley, 2007. [ bib | .ps.gz | .pdf ] We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data. [Bar07] Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Technical Report 729, Department of Statistics, U.C. Berkeley, 2007. [ bib | .pdf ] We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty. [BT07a] Peter L. Bartlett and Ambuj Tewari. Sample complexity of policy search with known dynamics. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 97--104, Cambridge, MA, 2007. MIT Press. [ bib | .pdf ] [RBR07b] Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1193--1200, Cambridge, MA, 2007. MIT Press. [ bib | .pdf ] [RB07b] David Rosenberg and Peter L. Bartlett. The Rademacher complexity of co-regularized kernel classes. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, volume 2, pages 396--403, 2007. [ bib | .pdf ] In the multi-view approach to semi-supervised learning, we choose one predictor from each of multiple hypothesis classes, and we co-regularize' our choices by penalizing disagreement among the predictors on the unlabeled data. In this paper we examine the co-regularization method used in the recently proposed co-regularized least squares (CoRLS) algorithm. In this method we have two hypothesis classes, each a reproducing kernel Hilbert space (RKHS), and we co-regularize by penalizing the average squared difference in predictions on the unlabeled data. We get our final predictor by taking the pointwise average of the predictors from each view. We call the set of predictors that can result from this procedure the co-regularized hypothesis class. The main result of this paper is a tight bound on the Rademacher complexity of the co-regularized hypothesis class in terms of the kernel matrices of each RKHS. We find that the co-regularization reduces the Rademacher complexity of the hypothesis class by an amount depending on how different the two views are, measured by a data dependent metric. We then use standard techniques to bound the gap between training error and test error for the CoRLS algorithm. Experimentally, we find that the amount of reduction in complexity introduced by co-regularization correlates with the amount of improvement that co-regularization gives in the CoRLS algorithm [BT07c] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 105--112, Cambridge, MA, 2007. MIT Press. [ bib | .pdf ] The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1-a) iterations---for sample size n and a>0---the sequence of risks of the classifiers it produces approaches the Bayes risk. [TB07a] Ambuj Tewari and Peter L. Bartlett. Bounded parameter Markov decision processes with average reward criterion. In Proceedings of the Conference on Learning Theory, pages 263--277, 2007. [ bib ] [ABR07a] Jacob Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. In Proceedings of the Conference on Learning Theory, pages 484--498, 2007. [ bib ] We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the ideal' algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient randomized algorithm based on Markov chain Monte Carlo techniques. [RAB07] Alexander Rakhlin, Jacob Abernethy, and Peter L. Bartlett. Online discovery of similarity mappings. In Proceedings of the 24th International Conference on Machine Learning (ICML-2007), pages 767--774, 2007. [ bib ] We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the `Online Prediction with Expert Advice' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems. [BT07d] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347--2368, 2007. [ bib | .pdf | .pdf ] The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n(1-a) iterations---for sample size n and 0

This file was generated by bibtex2html 1.98.