Peter Bartlett's Journal Papers and Book Chapters

[BMN12] Peter L. Bartlett, Shahar Mendelson, and Joseph Neeman. l1-regularized linear regression: Persistence and oracle inequalities. Probability Theory and Related Fields, 2012. To appear. [ bib ]
[RBHT12] Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: Privacy preserving mechanisms for svm learning. Journal of Privacy and Confidentiality, 2012. To appear. [ bib ]
[ABRW12] Alekh Agarwal, Peter Bartlett, Pradeep Ravikumar, and Martin Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 2012. To appear. [ bib ]
[BRS+12] A. Barth, Benjamin I. P. Rubinstein, M. Sundararajan, J. C. Mitchell, Dawn Song, and Peter L. Bartlett. A learning-based approach to reactive security. IEEE Transactions on Dependable and Secure Computing, 2012. To appear. [ bib ]
[AB11] Sylvain Arlot and Peter L. Bartlett. Margin-adaptive model selection in statistical learning. Bernoulli, 2011. To appear (accepted May 21, 2010). [ bib | .pdf ]
[BMP10] Peter L. Bartlett, Shahar Mendelson, and Petra Philips. On the optimality of sample-based estimates of the expectation of the empirical minimizer. ESAIM: Probability and Statistics, 14:315-337, 2010. [ bib | .pdf ]
We study sample-based estimates of the expectation of the function produced by the empirical minimization algorithm. We investigate the extent to which one can estimate the rate of convergence of the empirical minimizer in a data dependent manner. We establish three main results. First, we provide an algorithm that upper bounds the expectation of the empirical minimizer in a completely data-dependent manner. This bound is based on a structural result in http://www.stat.berkeley.edu/~bartlett/papers/bm-em-03.pdf, which relates expectations to sample averages. Second, we show that these structural upper bounds can be loose. In particular, we demonstrate a class for which the expectation of the empirical minimizer decreases as O(1/n) for sample size n, although the upper bound based on structural properties is Ω(1). Third, we show that this looseness of the bound is inevitable: we present an example that shows that a sharp bound cannot be universally recovered from empirical data.

[Bar10] Peter L. Bartlett. Learning to act in uncertain environments. Communications of the ACM, 53(5):98, 2010. [ bib ]
[RBR10] Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Corrigendum to `shifting: One-inclusion mistake bounds and sample compression' [j. comput. system sci 75 (1) (2009) 37-59]. Journal of Computer and System Sciences, 76(3-4):278-280, 2010. [ bib ]
[RSBN09] David S. Rosenberg, Vikas Sindhwani, Peter L. Bartlett, and Partha Niyogi. Multiview point cloud kernels for semisupervised learning. IEEE Signal Processing Magazine, 26(5):145-150, September 2009. [ bib ]
[RBR09] Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. Hyam Rubinstein. Shifting: one-inclusion mistake bounds and sample compression. Journal of Computer and System Sciences, 75(1):37-59, January 2009. (Was University of California, Berkeley, EECS Department Technical Report EECS-2007-86). [ bib | .pdf ]
[Bar08] Peter L. Bartlett. Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory, 24(2):545-552, 2008. (Was Department of Statistics, U.C. Berkeley Technical Report number 729, 2007). [ bib | .pdf ]
We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error (the difference between the risk of the empirical risk minimizer and the minimal risk in the class) approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk (and indeed the error of any estimate of the approximation error) decreases much more slowly than the complexity penalty. We give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty.

[BW08] Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9:1823-1840, 2008. [ bib | .pdf ]
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function f, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss-the f-risk-also minimizes the risk. We also study the rate at which the f-risk approaches its minimum value. We show that fast rates are possible when the conditional probability Pr(Y=1|X) is unlikely to be close to certain critical values.

[CGK+08] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter L. Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. Journal of Machine Learning Research, 9:1775-1822, 2008. [ bib | .pdf ]
Log-linear and maximum-margin models are two commonly used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online algorithm requires a factor of n less computation to reach a desired accuracy, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to multi-class problems as well as a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods.

[LBW08] Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson. Correction to the importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 54(9):4395, 2008. [ bib | .pdf ]
[TB07] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007-1025, May 2007. (Invited paper). [ bib | .html ]
[BT07a] Peter L. Bartlett and Ambuj Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. Journal of Machine Learning Research, 8:775-790, April 2007. [ bib | .html ]
[BT07b] Peter L. Bartlett and Mikhail Traskin. Adaboost is consistent. Journal of Machine Learning Research, 8:2347-2368, 2007. [ bib | .pdf ]
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n^(1-a) iterations-for sample size n and 0<a<1-the sequence of risks of the classifiers it produces approaches the Bayes risk.

[BJM06b] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138-156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [ bib | .ps.gz | .pdf ]
Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.

[BM06b] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311-334, 2006. [ bib | .ps.gz | .pdf ]
We investigate the behavior of the empirical minimization algorithm using various methods. We first analyze it by comparing the empirical, random, structure and the original one on the class, either in an additive sense, via the uniform law of large numbers, or in a multiplicative sense, using isomorphic coordinate projections. We then show that a direct analysis of the empirical minimization algorithm yields a significantly better bound, and that the estimates we obtain are essentially sharp. The method of proof we use is based on Talagrand's concentration inequality for empirical processes.

[BJM06a] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Comment. Statistical Science, 21(3):341-346, 2006. [ bib ]
[BM06a] Peter L. Bartlett and Shahar Mendelson. Discussion of “2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization” by V. Koltchinskii. The Annals of Statistics, 34(6):2657-2663, 2006. [ bib ]
[BBM05] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497-1537, 2005. [ bib | .ps | .pdf ]
We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to prediction with bounded loss, and to regression with a convex loss function and a convex function class.

[LCB+04] G. Lanckriet, N. Cristianini, P. L. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27-72, 2004. [ bib | .ps.gz | .pdf ]
[GBB04] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5:1471-1530, 2004. [ bib | .pdf ]
[BJM04] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Discussion of boosting papers. The Annals of Statistics, 32(1):85-91, 2004. [ bib | .ps.Z | .pdf ]
[BM03] Peter L. Bartlett and Wolfgang Maass. Vapnik-Chervonenkis dimension of neural nets. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 1188-1192. MIT Press, 2003. Second Edition. [ bib | .ps.gz | .pdf ]
[Bar03] Peter L. Bartlett. An introduction to reinforcement learning theory: value function methods. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, volume 2600, pages 184-202. Springer, 2003. [ bib ]
[GBSTW02] Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. IEEE Transactions on Information Theory, 48(1):239-250, 2002. [ bib ]
[BM02] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463-482, 2002. [ bib | .pdf ]
[BBL02] P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85-113, 2002. [ bib | .ps.gz ]
[BB02] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and System Sciences, 64(1):133-150, 2002. [ bib ]
[BBD02] P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems. Theoretical Computer Science, 284(1):53-66, 2002. (special issue on Eurocolt'99). [ bib | http ]
[BFH02] P. L. Bartlett, P. Fischer, and K.-U. Höffgen. Exploiting random walks for learning. Information and Computation, 176(2):121-135, 2002. [ bib | http ]
[MBG02] L. Mason, P. L. Bartlett, and M. Golea. Generalization error of combined classifiers. Journal of Computer and System Sciences, 65(2):415-438, 2002. [ bib | http ]
[BB01] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319-350, 2001. [ bib | .html ]
[BBW01] J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15:351-381, 2001. [ bib | .html ]
[AB00] M. Anthony and P. L. Bartlett. Function learning from interpolation. Combinatorics, Probability, and Computing, 9:213-225, 2000. [ bib ]
[MBBF00] L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 221-246. MIT Press, 2000. [ bib ]
[SBSS00] A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans. Introduction to large margin classifiers. In Advances in Large Margin Classifiers, pages 1-29. MIT Press, 2000. [ bib ]
[BBDK00] P. L. Bartlett, S. Ben-David, and S. R. Kulkarni. Learning changing concepts by exploiting the structure of change. Machine Learning, 41(2):153-174, 2000. [ bib ]
[PPB00] S. Parameswaran, M. F. Parkinson, and P. L. Bartlett. Profiling in the ASP codesign environment. Journal of Systems Architecture, 46(14):1263-1274, 2000. [ bib ]
[SSWB00] B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207-1245, 2000. [ bib ]
[KBB00] L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Direct iterative tuning via spectral analysis. Automatica, 36(9):1301-1307, 2000. [ bib ]
[MBB00] L. Mason, P. L. Bartlett, and J. Baxter. Improved generalization through explicit optimization of margins. Machine Learning, 38(3):243-255, 2000. [ bib ]
[BL99] P. L. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from their means. Statistics and Probability Letters, 44(1):55-62, 1999. [ bib ]
[Bar99] P. L. Bartlett. Efficient neural network learning. In V. D. Blondel, E. D. Sontag, M. Vidyasagar, and J. C. Willems, editors, Open Problems in Mathematical Systems Theory and Control, pages 35-38. Springer Verlag, 1999. [ bib ]
[BST99] P. L. Bartlett and J. Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 43-54. MIT Press, 1999. [ bib ]
[SFBL98] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651-1686, 1998. [ bib ]
[BMM98] P. L. Bartlett, V. Maiorov, and R. Meir. Almost linear VC dimension bounds for piecewise polynomial networks. Neural Computation, 10(8):2159-2173, 1998. [ bib ]
[LBW98] W. S. Lee, P. L. Bartlett, and R. C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974-1980, 1998. [ bib ]
[STBWA98] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926-1940, 1998. [ bib ]
[BLL98] P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802-1813, 1998. [ bib ]
[BK98] P. L. Bartlett and S. Kulkarni. The complexity of model classes, and smoothing noisy data. Systems and Control Letters, 34(3):133-140, 1998. [ bib ]
[BV98] P. L. Bartlett and M. Vidyasagar. Introduction to the special issue on learning theory. Systems and Control Letters, 34:113-114, 1998. [ bib ]
[KBB98] L. C. Kammer, R. R. Bitmead, and P. L. Bartlett. Optimal controller properties from closed-loop experiments. Automatica, 34(1):83-91, 1998. [ bib ]
[BL98] P. L. Bartlett and P. M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and System Sciences, 56(2):174-190, 1998. (special issue on COLT`95). [ bib ]
[Bar98] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525-536, 1998. [ bib ]
[BKP97] P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes. IEEE Transactions on Information Theory, 43(5):1721-1724, 1997. [ bib ]
[Bar97] P. L. Bartlett. Book review: `Neural networks for pattern recognition,' Christopher M. Bishop. Statistics in Medicine, 16(20):2385-2386, 1997. [ bib ]
[LBW97] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Correction to `lower bounds on the VC-dimension of smoothly parametrized function classes'. Neural Computation, 9:765-769, 1997. [ bib ]
[LBW96] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):2118-2132, 1996. [ bib ]
[ABIST96] M. Anthony, P. L. Bartlett, Y. Ishai, and J. Shawe-Taylor. Valid generalisation from approximate interpolation. Combinatorics, Probability, and Computing, 5:191-214, 1996. [ bib ]
[BLW96] P. L. Bartlett, P. M. Long, and R. C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434-452, 1996. (special issue on COLT`94). [ bib ]
[BW96] P. L. Bartlett and R. C. Williamson. The Vapnik-Chervonenkis dimension and pseudodimension of two-layer neural networks with discrete inputs. Neural Computation, 8:653-656, 1996. [ bib ]
[LBW95] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Lower bounds on the VC-dimension of smoothly parametrized function classes. Neural Computation, 7:990-1002, 1995. (See also correction, Neural Computation, 9: 765-769, 1997). [ bib ]
[Bar94] P. L. Bartlett. Computational learning theory. In A. Kent and J. G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 31, pages 83-99. Marcel Dekker, 1994. [ bib ]
[Bar93] P. L. Bartlett. Vapnik-Chervonenkis dimension bounds for two- and three-layer networks. Neural Computation, 5(3):371-373, 1993. [ bib ]
[LBD92] D. R. Lovell, P. L. Bartlett, and T. Downs. Error and variance bounds on sigmoidal neurons with weight and input errors. Electronics Letters, 28(8):760-762, 1992. [ bib ]
[BD92] P. L. Bartlett and T. Downs. Using random weights to train multi-layer networks of hard-limiting units. IEEE Transactions on Neural Networks, 3(2):202-210, 1992. [ bib ]

This file was generated by bibtex2html 1.95.