CS 294 / Stat 260, Fall 2014:
Learning in Sequential Decision
Problems
Readings
This page contains pointers to a collection of resources
on the topics covered in lectures.
Tutorials/Courses/Survey Papers/Texts
Stochastic Bandits
Classics
Regret bounds and UCB strategies

1995.
Rajeev Agrawal.
Sample mean based index policies with O(log n) regret
for the multiarmed bandit problem.
Advances in Applied
Mathematics, 27: 10541078.
2002.
Peter Auer, Nicolò CesaBianchi
and Paul Fischer.
Finite time
analysis of the multiarmed bandit problem.
Machine Learning,
47(23), 235256.
Studied stochastic bandit problem. Agrawal applied large deviations
analysis. Auer et al gave finite sample results.
Principle of optimism in the face of uncertainty.

2013.
Olivier Cappé, Aurélien Garivier, OdalricAmbrym
Maillard, Rémi
Munos, Gilles Stoltz.
KullbackLeibler Upper Confidence Bounds for Optimal Sequential
Allocation.
Annals of Statistics, 41(3): 15161541.
Studied KLUCB, a generalization of Auer et al's UCB strategy.
Gave finitetime regret bounds with optimal leading
terms for bounded and exponential family reward distributions.
These results also imply the best known regret bounds for
UCB.
(Beware of some typos throughout Section 6.1: moment
generating functions are used in place of cumulant generating
functions, so sprinkle in some "log"s and replace a few "moment"s with
"cumulant"s in that section.)

2012.
Emilie Kaufmann, Olivier Cappe, Aurelien Garivier.
Bayes UCB: On Bayesian Upper Confidence Bounds for Bandit Problems.
JMLR W&CP 22:592600.
Introduces UCB strategies based on quantiles of posterior
distributions under Bayesian updates, and demonstrates their regret performance.
Gittins Index

1974.
J. C. Gittins and D. M. Jones.
A dynamic allocation index for the
sequential design of experiments.
In
J. Gani (Ed.) Progress in Statistics,
NorthHolland, Amsterdam, pp 241266.
1979.
J. C. Gittins.
Bandit processes and dynamic allocation indices.
J. R. Statist. Soc. B. 41(2):148177.
Studied index policies. To maximize expected reward in a Bayesian
formulation (Bernoulli with conjugate  i.e., Beta  prior):
convert to a stopping problem. Optimal strategy involves a comparison of
indices, one for each arm.

1992.
Richard Weber.
On the Gittins Index for Multiarmed Bandits.
Ann. Appl. Probab. 2(4):10241033.
Gives the elegant, simple proof of the optimality of Gittins indices
that we looked at in class.

1980.
Peter Whittle.
Multiarmed Bandits and the Gittins Index.
Journal of the Royal Statistical Society, Series B 42(2): 143–149.
1994.
John Tsitsiklis.
A short proof of the Gittins index theorem.
Ann. Appl. Probab. 4(1): 194199.
Two alternative proofs of the Gittins index theorem:
the first based on the retirement option
interpretation, the second on a semiMarkov process formulation.

2014.
Jhelum Chakravorty and Aditya Mahajan.
In Methods and Applications of Statistics
MultiArmed Bandits, Gittins Index, and its Calculation.
in Clinical Trials: Planning, Analysis, and Inferential
Methods,
Volume 2 (ed N. Balakrishnan), John Wiley & Sons, Inc., Hoboken, NJ,
USA.
Recent survey of approaches to computing the Gittins index.
Thompson Sampling

2012.
Shipra Agrawal and Navin Goyal.
Analysis of Thompson Sampling for the Multiarmed Bandit Problem.
JMLR W&CP 23:39.139.26.
2012.
Emilie Kaufmann, Nathaniel Korda and Rémi Munos.
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis.
arXiv:1205.4217 [stat.ML].
2012.
Shipra Agrawal and Navin Goyal.
Further Optimal Regret Bounds for Thompson Sampling.
arXiv:1209.3353 [cs.LG].
2013.
Nathaniel Korda, Emilie Kaufmann, and Rémi Munos.
Thompson Sampling for 1Dimensional Exponential Family Bandits.
arXiv:1307.3400 [stat.ML].
A series of papers that analyze the regret of Thompson sampling.
The first shows that, for Bernoulli rewards, Thompson sampling with a
uniform prior has (frequentist, distributiondependent)
regret that is logarithmic in the number of
rounds. The second gives a more refined analysis, with the right
leading constant (matching the Lai and Robbins lower bound).
The third (which we looked at in class) gives an analysis that extends
to more general settings.
The fourth considers exponential family reward distributions,
and Jeffreys prior.

2013.
Daniel Russo and Benjamin Van Roy.
Learning to Optimize Via Posterior Sampling.
arXiv:1301.2609 [cs.LG].
Gives a relationship between Thompson sampling and UCB, and introduces
the eluder dimension as a measure of the complexity of dependent
rewards.

2013.
Aditya Gopalan, Shie Mannor and Yishay Mansour.
Thompson Sampling for Complex Bandit Problems.
arXiv:1311.0466 [stat.ML].
Gives regret bounds for Thompson sampling in more complex situations:
compound actions, structured rewards, rather general priors
likelihoods, and observations.

2010.
Steve Scott.
A modern Bayesian look at the multiarmed bandit.
Appl. Stochastic Models Bus. Ind.26:639658.
2011.
Olivier Chapelle and Lihong Li.
An Empirical Evaluation of Thompson Sampling.
Advances in Neural Information Processing Systems 24:22492257.
Simulation studies and experimental
results for Thompson sampling.
Adversarial Bandits
Partial monitoring

2001.
Antonio Piccolboni and Christian Schindelhauer.
Discrete prediction games with arbitrary feedback and loss.
COLT2001
Introduced strategy using the linear loss estimation idea,
characterized the trivial and impossible problems.

2006.
Nicoló CesaBianchi, Gábor Lugosi, and Gilles Stoltz.
Regret minimization under partial monitoring.
Mathematics of Operations Research
31:562580.
Proved a better regret rate for the Piccolboni and Schindelhauer
strategy in the adversarial setting,
exhibited a game with n^{2/3} regret.

2011.
Gábor Bartók, David Pal, and Csaba Szepesvári.
Minimax Regret of Finite PartialMonitoring Games in
Stochastic Environments.
COLT2011
Completed characterization of regret rates for stochastic partial
monitoring decision problems.

2011.
Dean Foster and Alexander Rakhlin.
No Internal Regret via Neighborhood Watch.
Gave a strategy for the locally observable problem
(with an oblivious adversary,
under a nondegenerate, nonduplicate condition).

2012.
Gábor Bartók, Navid Zolghadr, and Csaba Szepesvári.
An adaptive algorithm for finite stochastic partial monitoring.
ICML2012
Presented a single strategy that
achieves optimal regret rates for easy and hard stochastic problems.
Contextual bandits

1979.
Michael Woodroofe.
A OneArmed Bandit Problem with a Concomitant Variable.
JASA 74(368): 799806.
Bayesian, one reward distribution with simple
oneparameter model, discounted cumulative
reward. Greedy strategy is optimal as discount factor approaches one.

1991. Jyotirmoy Sarkar.
OneArmed Bandit Problems with Covariates.
Ann. Statist. 19(4): 19782002.
Extends Woodroofe's result to oneparameter exponential family
models.

2005.
C.C. Wang, S.R. Kulkarni and H.V. Poor.
Bandit Problems With Side Observations.
IEEE Transactions on Automatic Control 50(3): 338355.
Considers asymptotics of parametric contextual bandit problems.
Shows that the Woodroofe message (that the greedy strategy gives
sufficient exploration) is true whenever the context distribution is
fixed and the optimal arm depends on the context for all values of
the parameters.

2002.
Peter Auer, Nicolò CesaBianchi, Yoav Freund and Robert
E. Schapire.
The Nonstochastic MultiArmed Bandit Problem.
SIAM J. Comput.
32(1):4877.
Section 7 considers the adversarial bandit problem with expert advice,
introduces Exp4, gives regret bounds.

1999.
Naoki Abe and Philip M. Long.
Associative reinforcement learning using linear probabilistic concepts.
ICML1999
pages 311.
2002.
Peter Auer.
Using Confidence Bounds for ExploitationExploration Tradeoffs.
JMLR
3(Nov):397422, 2002.
2010.
Lihong Li, Wei Chu, John Langford and Robert E. Schapire.
A ContextualBandit Approach to Personalized News Article Recommendation.
WWW'10
pp. 661670.
2011.
Wei Chu, Lihong Li, Lev Reyzin and Robert E. Schapire.
Contextual Bandits with Linear Payoff Functions.
AIStats2011
pp. 208214.
Abe and Long introduced a contextual bandit problem where the expected
reward of each arm is a fixed linear function of the context vector
for that arm. They gave a strategy with optimal regret when the
dimension grows over time.
Auer gave an upper confidence bound strategy with optimal regret rate
for fixed dimension.
Li et al and Chu et al
considered versions with a separate linear function for each
arm, and with both fixed effects (linear for all arms) and random
effects (separate for each arm). Introduced UCB strategies with
sqrt(nd) regret bounds, for n rounds and dimension d context vectors.

2011.
Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin and Robert E.
Schapire.
Contextual Bandit Algorithms with Supervised Learning Guarantees.
AIStats 2011
pp. 1926.
Applied Exp4 to a cover of a comparison class with finite
VCdimension.

2011.
Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John
Langford, Lev Reyzin and Tong Zhang.
Efficient Optimal Learning for Contextual Bandits.
UAI 2011
pp. 169178.
2014.
Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong
Li and Robert E. Schapire.
Taming the Monster: A Fast and Simple Algorithm for Contextual
Bandits.
ICML 2014
JMLR W&CP 32(1):16381646, 2014.
Dudik et al introduced a reduction from the contextual bandit problem to
costsensitive classification, allowing the use of an oracle that can
minimize empirical risk. The reduction is based on the ellipsoid
algorithm.
Agarwal et al introduced a simpler reduction, based on gradient descent of
an entropyregularized empirical regret.
For comparison classes with efficient empirical risk minimization
oracles, these strategies would be efficient.
Linear bandits

2008.
B. Awerbuch and R. Kleinberg.
Online linear optimization and adaptive routing.
Journal of Computer and System Sciences 74(1):97114.
Introduced the linear bandits problem, and the idea of a barycentric
spanner.

2008.
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade.
The Price of Bandit Information for Online Optimization.
NIPS 2007.
Used a barycentric spanner exploration distribution and exponential
weights, showed regret O(d sqrt(dn)). Gave sqrt(dn) lower bounds.

2009.
Nicolò CesaBianchi, and Gábor Lugosi.
Combinatorial Bandits.
COLT 2009.
Used a uniform exploration distribution and exponential
weights, showed that regret is O(d sqrt(n)) when exploration covariance
matrix has smallest nonlinear eigenvalue Omega(1/d) (and important
examples where this occurs).

2012.
Sebastien Bubeck, Nicolò CesaBianchi, and Sham M. Kakade.
Towards Minimax Policies for Online Linear Optimization with Bandit
Feedback.
COLT 2012.
Applied Johns Theorem to show that a suitable exploration distribution
always exists.

2012.
Jacob Abernethy, Elad Hazan and Alexander Rakhlin.
InteriorPoint Methods for FullInformation and Bandit Online
Learning.
IEEE Transactions on Information Theory. 58(7):41644175, 2012.
The first general efficient linear bandit strategy with a sqrt T
regret rate. Uses mirror descent with a selfconcordant barrier
regularizer.

2011.
Elad Hazan and Satyen Kale.
Better Algorithms for Benign Bandits.
JMLR 12:1287−1311, 2011.
An improvement that uses the techniques of Abernethy et al together
with reservoir sampling, and achieves sqrt Q regret, so Q, the
quadratic variation of the costs, plays the role of time T.

2012.
Sébastien Bubeck, Nicolò CesaBianchi and
Sham M. Kakade.
Towards Minimax Policies for Online Linear Optimization with Bandit
Feedback.
COLT 2012.
The online stochastic mirror descent strategy for linear bandits
on a Euclidean ball, as discussed in class.
Back to course home page