CS 294 / Stat 260, Fall 2014:

Learning in Sequential Decision Problems


Lectures:  Evans 334. Tuesday/Thursday 2:00-3:30.

Instructor:

Peter Bartlett (bartlett at cs)

Office Hours:
Mon 1:00-2:00, 723 Sutardja-Dai Hall.
Thu 1:00-2:00, 399 Evans Hall.

Course description

This course will focus on the design and theoretical analysis of learning methods for sequential decision-making under uncertainty. Sequential decision problems involve a trade-off between exploitation (optimizing performance based on the information at hand) and exploration (gathering more information). These problems arise in many important domains, ranging from clinical trials, through computer network optimization and adaptive packet routing, to website and page content optimization, marketing campaign and internet advertising optimization, and revenue management.
Topics covered will include a selection from the following list. Stochastic and game theoretic formulations of sequential decision problems: multi-armed bandits, linear, convex, and Lipschitz bandits, large-scale (combinatorial) bandit problems, contextual bandits, Markov decision processes, approximate linear programming approaches to controlling MDPs, tools for finite sample regret analysis.

Prerequisites: Probability theory or statistics (at the level of Stat 205A and 210A). Some previous exposure to algorithms, game theory, linear algebra, convex optimization will be helpful.

Syllabus

Assessment

The assessment will have two components: presentation of a paper in class and participation in the discussion of these papers (30%=20% presentation + 10% participation), and a final project (70%).

For paper presentations, you will need to choose a slot (using the doodle poll) and choose a paper from the list of available papers. Please email me your choice. (Note that there will be other papers on the list later; I'll post an announcement when the list is updated.) The presentations will take place at the usual lecture time and place. For the presentation, you will need to present the main contributions of the paper, and lead the discussion about the paper. We'll have 20 minutes total: Aim to have material to present that will take around 10 minutes if there weren't any interruptions, and we'll allow about another ten minutes for discussion - both during and afterwards. It's probably best, since it's a short time slot, to use a laptop and the projector. Don't copy the paper on to your slides. Don't feel obliged to present or discuss everything in the paper. Do aim to include a critique of what the paper does, and questions or open problems that emerge.

The final project can be in any area related to the topics of the course. You might extend a theoretical result, develop a new method and investigate its performance, or run experiments on an existing method for a particular application, or do a combination of these. You will need to submit a written report and give a presentation in class. It is OK to work on projects in groups of two (please email me an explanation if there's a good reason to work in a larger group). In all cases you will need to write the report individually. Project proposals are due on September 30 (please send one or two plain text paragraphs in an email message to bartlett at cs). Project reports are due on December 5. Please email a pdf file to bartlett at cs.

Readings

(The readings list is being constantly updated.)

Announcements


Lectures


Thu, Aug 28 Organizational issues. Course outline. Stochastic bandits.
Syllabus
Stochastic bandits
Tue, Sep 2 Regret lower bounds for stochastic bandits.
Lecture will stop at 2:30, to allow the class to attend Tze Leung Lai's talk (2:30-3:30, 3113 Etcheverry Hall).
Lower bounds
Tze Leung Lai talk details
Thu, Sep 4 Regret lower bounds.
Robbins' strategy.
Robbins' strategy
See also (Robbins, 1952).
Tue, Sep 9 Regret upper bounds:
Concentration inequalities.
UCB.
UCB notes
See also: Section 2.2 of (Bubeck and Cesa-Bianchi, 2012); (Agrawal, 1995); (Auer, Cesa-Bianchi and Fischer, 2002); Sections 3 and 4 of (Lai and Robbins, 1985).
Thu, Sep 11 KL-UCB.
KL-UCB notes
See also: (Cappé, Garivier, Maillard, Munos, Stoltz, 2013).
Tue, Sep 16 More KL-UCB.
Updated KL-UCB notes
Thu, Sep 18 Gittins index.
Gittins notes
See also: (Weber, 1992).
Tue, Sep 23 Thompson sampling.
Thompson notes
See also: (Agrawal and Goyal, 2012).
Thu, Sep 25 Minimax regret.
Minimax notes
See also: Section 5 and Appendix A of (Auer et al, 2002).
Tue, Sep 30 Adversarial bandits.
Adversarial bandits
See also: Section 3 of (Auer et al, 2002).
Thu, Oct 2 Partial information games.
Partial monitoring
See also: (Piccolboni and Schindelhauer, 2001), (Cesa-Bianchi et al, 2006), (Bartók et al, 2011).
Tue, Oct 7 Contextual bandits.
Contextual bandits
See also: (Woodroofe, 1979), (Sarkar, 1991), Section 7 of (Auer et al, 2002).
Thu, Oct 9 Contextual bandits: Infinite comparison classes. epsilon-covers.
Contextual bandits, epsilon-covers
See also: (Beygelzimer et al, 2011).
Discussion paper: Kaufmann et al, 2012. Bayes UCB: On Bayesian Upper Confidence Bounds for Bandit Problems.
Tue, Oct 14 Contextual bandits: reduction to classification.
Contextual bandits, Reduction to classification
See also: (Agarwal et al, 2014).
Discussion papers: Gyorgy et al, 2007. The on-line shortest path problem under partial monitoring.
Agrawal and Goyal, 2012. Thompson Sampling for Contextual Bandits with Linear Payoffs.
Thu, Oct 16 Linear bandits.
Linear bandits
See also: (Awerbuch and Kleinberg, 2008), (Dani et al, 2008)
Discussion paper: Chapelle and Li, 2011. An Empirical Evaluation of Thompson Sampling.
Tue, Oct 21 Linear bandits: exponential weights.
More linear bandits
See also: (Cesa-Bianchi and Lugosi, 2009), (Bubeck et al, 2012)
Discussion paper: Bertsimas and Nino-Mora, 2000. Restless bandits, linear programming relaxations, and a primal-dual index heuristic.
Thu, Oct 23 Linear bandits: lower bounds.
Still more linear bandits
See also: (Dani et al, 2008)
Discussion papers: Bartok, 2013. A near-optimal algorithm for finite partial-monitoring games against adversarial opponents.
Korda et al, 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits.
Audibert et al. 2009. Exploration-exploitation trade-off using variance estimates in multi-armed bandits.
Tue, Oct 28 Linear bandits: stochastic mirror descent.
Mirror descent
See also: Chapter 5 of Bubeck and Cesa-Bianchi, 2012.
Discussion papers: Gopalan et al, 2013. Thompson Sampling for Complex Bandit Problems.
Audibert et al, 2012. Regret in Online Combinatorial Optimization.
Thu, Oct 30 Linear bandits: stochastic mirror descent.
See also: Bubeck et al, 2012.
Discussion papers: Russo and Van Roy, 2013. Learning to Optimize Via Posterior Sampling.
Badanidiyuru et al, 2014. Resourceful Contextual Bandits.
Tue, Nov 4 Markov decision processes.
MDPs
See also Slides based on Bertsekas, 2005.
Discussion paper: Mannor and Shamir, 2011. From Bandits to Experts: On the Value of Side-Observations.
Thu, Nov 6 More Markov decision processes.
More MDPs
Discussion papers: Salomon and Audibert, 2011. Deviations of stochastic bandit regret.
Slivkins, 2009. Contextual Bandits with Similarity Information.
Tue, Nov 11 Veterans Day
Thu, Nov 13 Approximate methods for MDPs.
Approximate methods
See also Chapter 6 of Bertsekas, 2012.
Discussion papers: Hazan and Kale, 2011. Better Algorithms for Benign Bandits.
Abernethy et al, 2012. Interior-Point Methods for Full-Information and Bandit Online Learning.
Tue, Nov 18 Guest lecture:
Nikos Vlassis
On the Computational Complexity of Stochastic Controller Optimization in POMDPs.

Discussion papers: Filippi et al, 2010. Parametric Bandits: The Generalized Linear Case.
Bubeck et al, 2013. Bounded regret in stochastic multi-armed bandits.
Tue, Nov 25 Guest lecture:
Mohammad Ghavamzadeh.
Finite-Sample Analysis of Approximate DP Algorithms.
See also: ICML2012 Tutorial: Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming.
Thu, Nov 27 Thanksgiving
Tue, Dec 2 Final project presentations: CS 194 students.
Max and Walid; Dylan; Ming and Yuxun; Michael and Jeff; Ke; James.
Thu, Dec 4 Final project presentations: Stat 260 students.
Animesh; Andres and Yonatan; Kieren; Soeren and Soumendu; Jiung and Siyuan; Yannik; Auyon and Birce.