Peter Bartlett's 2006-2008 Papers

[BJM06b] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138-156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [ bib | .ps.gz | .pdf ]
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.

[BM06b] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311-334, 2006. [ bib | .ps.gz | .pdf ]
We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes.

[BMP07] Peter L. Bartlett, Shahar Mendelson, and Petra Philips. Optimal sample-based estimates of the expectation of the empirical minimizer. Technical report, U.C. Berkeley, 2007. [ bib | .ps.gz | .pdf ]
We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.

[BMP08] Peter L. Bartlett, Shahar Mendelson, and Petra Philips. Optimal sample-based estimates of the expectation of the empirical minimizer. ESAIM: Probability and Statistics, 2008. (Accepted). [ bib | .ps.gz | .pdf ]
We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.

[Bar07] Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Technical Report 729, Department of Statistics, U.C. Berkeley, 2007. [ bib | .pdf ]
We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.

[Bar08] Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory, 24(2), 2008. (To appear. Was Department of Statistics, U.C. Berkeley Technical Report number 729, 2007). [ bib | .pdf ]
We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.

[BW06] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Technical report, U.C. Berkeley, 2006. [ bib | .ps.gz | .pdf ]
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss-the f-risk-also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1|X) is unlikely to be close to certain critical values.

[BW08] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 2008. (To appear.). [ bib | .ps.gz | .pdf ]
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss-the f-risk-also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1|X) is unlikely to be close to certain critical values.

[BT07a] Peter L. Bartlett and Ambuj Tewari. Sample complexity of policy search with known dynamics. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 97-104, Cambridge, MA, 2007. MIT Press. [ bib | .pdf ]
[RBR07b] Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 1193-1200, Cambridge, MA, 2007. MIT Press. To appear. [ bib | .pdf ]
[RB07b] David Rosenberg and Peter L. Bartlett. The Rademacher complexity of co-regularized kernel classes. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007. To appear. [ bib | .pdf ]
In the multi-view approach to semi-supervised learning, we choose one predictor from each of multiple hypothesis classes, and we `co-regularize' our choices by penalizing disagreement among the predictors on the unlabeled data. In this paper we examine the co-regularization method used in the recently proposed co-regularized least squares (CoRLS) algorithm. In this method we have two hypothesis classes, each a reproducing kernel Hilbert space (RKHS), and we co-regularize by penalizing the average squared difference in predictions on the unlabeled data. We get our final predictor by taking the pointwise average of the predictors from each view. We call the set of predictors that can result from this procedure the co-regularized hypothesis class. The main result of this paper is a tight bound on the Rademacher complexity of the co-regularized hypothesis class in terms of the kernel matrices of each RKHS. We find that the co-regularization reduces the Rademacher complexity of the hypothesis class by an amount depending on how different the two views are, measured by a data dependent metric. We then use standard techniques to bound the gap between training error and test error for the CoRLS algorithm. Experimentally, we find that the amount of reduction in complexity introduced by co-regularization correlates with the amount of improvement that co-regularization gives in the CoRLS algorithm

[BT07c] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 105-112, Cambridge, MA, 2007. MIT Press. [ bib | .pdf ]
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1-a) iterations-for sample size n and a>0-the sequence of risks of the classifiers it produces approaches the Bayes risk.

[BT06b] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Technical report, U. C. Berkeley, 2006. [ bib ]
[BJM06a] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341-346, 2006. [ bib ]
[BT06a] Peter L. Bartlett and Mikhail Traskin. Adaboost and other large margin classifiers: Convexity in pattern classification. In Proceedings of the 5th Workshop on Defence Applications of Signal Processing, 2006. To appear. [ bib ]
[BM06a] Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimizationby V. Koltchinskii. The Annals of Statistics, 34(6):2657-2663, 2006. [ bib ]
[TB07a] Ambuj Tewari and Peter L. Bartlett. Bounded parameter Markov decision processes with average reward criterion. In Proceedings of the Conference on Learning Theory, pages 263-277, 2007. [ bib ]
[ABR07a] Jacob Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. In Proceedings of the Conference on Learning Theory, pages 484-498, 2007. [ bib ]
We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the `ideal' algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient randomized algorithm based on Markov chain Monte Carlo techniques.

[RAB07] Alexander Rakhlin, Jacob Abernethy, and Peter L. Bartlett. Online discovery of similarity mappings. In Proceedings of the 24th International Conference on Machine Learning (ICML-2007), 2007. To appear. [ bib ]
We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the `Online Prediction with Expert Advice' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.

[BT07d] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347-2368, 2007. [ bib | .pdf ]
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n(1-a) iterations-for sample size n and 0<a<1-the sequence of risks of the classifiers it produces approaches the Bayes risk.

[RBR07a] Benjamin IP. Rubinstein, Peter L. Bartlett, and JHyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Technical report, EECS Department, University of California, Berkeley, 2007. [ bib | .pdf ]
We present new expected risk bounds for binary and multiclass prediction, and resolve several recent conjectures on sample compressibility due to Kuzmin and Warmuth. By exploiting the combinatorial structure of concept class F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy. The key step in their proof is a d=VC(F) bound on the graph density of a subgraph of the hypercube - one-inclusion graph. The first main result of this report is a density bound of n choose(n-1,<=d-1)/choose(n,<=d) < d, which positively resolves a conjecture of Kuzmin and Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved one-inclusion mistake bound. The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is an algebraic topological property of maximum classes of VC-dimension d as being d-contractible simplicial complexes, extending the well-known characterization that d=1 maximum classes are trees. We negatively resolve a minimum degree conjecture of Kuzmin and Warmuth - the second part to a conjectured proof of correctness for Peeling - that every class has one-inclusion minimum degree at most its VC-dimension. Our final main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This result improves on known PAC-based expected risk bounds by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.

[RBR08] Benjamin IP. Rubinstein, Peter L. Bartlett, and JHyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 2008. (To appear. Was University of California, Berkeley, EECS Department Technical Report EECS-2007-86). [ bib | .pdf ]
[RB07a] David Rosenberg and Peter L. Bartlett. On bounds for Bayesian sequence prediction with non-Gaussian priors. Technical Report, 2007. [ bib ]
We present worst-case log-loss regret bounds for the Bayesian model averaging algorithm in the regression setting. Our work generalizes earlier work that gives bounds of a similar form that only hold for Gaussian priors. Our bounds hold for arbitrary priors, though the regret term now includes a term involving the modulus of continuity of the prior, as well as the value of the prior at the point of comparison.

[ABRT07] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Minimax lower bounds for online convex games. Technical Report, 2007. [ bib | .pdf ]
A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction x from a convex set, the environment plays a loss function f, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when f is assumed to be convex, and Hazan et al, when f is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving lower bounds in each case. These results prove that the existing algorithms are essentially optimal.

[BHR08] Peter L. Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient descent. In Daphne Koller, Yoram Singer, and John Platt, editors, Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press. To appear. [ bib | .pdf ]
We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between sqrt(T) and logT. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.

[TB08] Ambuj Tewari and Peter L. Bartlett. Optimistic linear programming gives logarithmic regret for irreducible mdps. In Daphne Koller, Yoram Singer, and John Platt, editors, Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press. To appear. [ bib | .pdf ]
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates: a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P)logT of the reward obtained by the optimal policy, where C(P) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.

[ABR07b] Jacob Duncan Abernethy, Peter L. Bartlett, and Alexander Rakhlin. Multitask learning with expert advice. Technical Report UCB/EECS-2007-20, EECS Department, University of California, Berkeley, 2007. [ bib | .html ]
[CGK+07] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Technical report, U.C. Berkeley, 2007. [ bib | .pdf ]
Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

[CGK+08] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 2008. (Accepted). [ bib | .pdf ]
Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

[BDH+08] Peter L. Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), 2008. (To appear). [ bib | .pdf ]
[ABRT08] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008), 2008. (To appear). [ bib | .pdf ]
[TB07b] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007-1025, May 2007. (Invited paper). [ bib | .html ]
[BT07b] Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775-790, April 2007. [ bib | .html ]
[LBW08] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 2008. (To appear). [ bib | .pdf ]
[AB08] Sylvain Arlot and Peter L. Bartlett. Margin-adaptive model selection in statistical learning. Submitted, 2008. [ bib | .pdf ]

This file was generated by bibtex2html 1.91.

Books

Journal papers, book chapters

All Publications