CS 294 / Stat 260, Fall 2014:

Learning in Sequential Decision Problems

Lectures: Evans 334. Tuesday/Thursday 2:00-3:30.

Instructor:

Peter Bartlett (bartlett at cs)

Office Hours:
Mon 1:00-2:00, 723 Sutardja-Dai Hall.
Thu 1:00-2:00, 399 Evans Hall.

Course description

This course will focus on the design and theoretical analysis of learning methods for sequential decision-making under uncertainty. Sequential decision problems involve a trade-off between exploitation (optimizing performance based on the information at hand) and exploration (gathering more information). These problems arise in many important domains, ranging from clinical trials, through computer network optimization and adaptive packet routing, to website and page content optimization, marketing campaign and internet advertising optimization, and revenue management.
Topics covered will include a selection from the following list. Stochastic and game theoretic formulations of sequential decision problems: multi-armed bandits, linear, convex, and Lipschitz bandits, large-scale (combinatorial) bandit problems, contextual bandits, Markov decision processes, approximate linear programming approaches to controlling MDPs, tools for finite sample regret analysis.

Prerequisites: Probability theory or statistics (at the level of Stat 205A and 210A). Some previous exposure to algorithms, game theory, linear algebra, convex optimization will be helpful.

Syllabus

Assessment

The assessment will have two components: presentation of a paper in class and participation in the discussion of these papers (30%=20% presentation + 10% participation), and a final project (70%).

For paper presentations, you will need to choose a slot (using the doodle poll) and choose a paper from the list of available papers. Please email me your choice. (Note that there will be other papers on the list later; I'll post an announcement when the list is updated.) The presentations will take place at the usual lecture time and place. For the presentation, you will need to present the main contributions of the paper, and lead the discussion about the paper. We'll have 20 minutes total: Aim to have material to present that will take around 10 minutes if there weren't any interruptions, and we'll allow about another ten minutes for discussion - both during and afterwards. It's probably best, since it's a short time slot, to use a laptop and the projector. Don't copy the paper on to your slides. Don't feel obliged to present or discuss everything in the paper. Do aim to include a critique of what the paper does, and questions or open problems that emerge.

The final project can be in any area related to the topics of the course. You might extend a theoretical result, develop a new method and investigate its performance, or run experiments on an existing method for a particular application, or do a combination of these. You will need to submit a written report and give a presentation in class. It is OK to work on projects in groups of two (please email me an explanation if there's a good reason to work in a larger group). In all cases you will need to write the report individually. Project proposals are due on September 30 (please send one or two plain text paragraphs in an email message to bartlett at cs). Project reports are due on December 5. Please email a pdf file to bartlett at cs.

Readings

(The readings list is being constantly updated.)

Announcements

Tue Nov 18. Project presentations will be in the class on Tuesday, December 2 and Thursday, December 4. Each project - both individuals and groups - will have a 10 minute presentation slot, with 60 seconds of set-up time between speakers. (I will stop you after 10 minutes.) If you are enrolled in CS294, you will present your talk in the session on Tuesday, December 2. If you are enrolled in Stat260, you will present it on Thursday, December 4. The presentations are in alphabetical order by surname - see the schedule below. If you cannot present your talk on the assigned day, please arrange to swap with someone in the other session and send me an email describing the swap.
Tue Nov 18. The final project reports are due on Friday, December 5. Everyone must write and submit their own report: whether projects were done individually or in a group, reports are to be done individually. Please email a pdf file to bartlett at cs by midnight on Friday, December 5.

Lectures

Thu, Aug 28	Organizational issues. Course outline. Stochastic bandits.	Syllabus Stochastic bandits
Tue, Sep 2	Regret lower bounds for stochastic bandits. Lecture will stop at 2:30, to allow the class to attend Tze Leung Lai's talk (2:30-3:30, 3113 Etcheverry Hall).	Lower bounds Tze Leung Lai talk details
Thu, Sep 4	Regret lower bounds. Robbins' strategy.	Robbins' strategy See also (Robbins, 1952).
Tue, Sep 9	Regret upper bounds: Concentration inequalities. UCB.	UCB notes See also: Section 2.2 of (Bubeck and Cesa-Bianchi, 2012); (Agrawal, 1995); (Auer, Cesa-Bianchi and Fischer, 2002); Sections 3 and 4 of (Lai and Robbins, 1985).
Thu, Sep 11	KL-UCB.	KL-UCB notes See also: (Cappé, Garivier, Maillard, Munos, Stoltz, 2013).
Tue, Sep 16	More KL-UCB.	Updated KL-UCB notes
Thu, Sep 18	Gittins index.	Gittins notes See also: (Weber, 1992).
Tue, Sep 23	Thompson sampling.	Thompson notes See also: (Agrawal and Goyal, 2012).
Thu, Sep 25	Minimax regret.	Minimax notes See also: Section 5 and Appendix A of (Auer et al, 2002).
Tue, Sep 30	Adversarial bandits.	Adversarial bandits See also: Section 3 of (Auer et al, 2002).
Thu, Oct 2	Partial information games.	Partial monitoring See also: (Piccolboni and Schindelhauer, 2001), (Cesa-Bianchi et al, 2006), (Bartók et al, 2011).
Tue, Oct 7	Contextual bandits.	Contextual bandits See also: (Woodroofe, 1979), (Sarkar, 1991), Section 7 of (Auer et al, 2002).
Thu, Oct 9	Contextual bandits: Infinite comparison classes. epsilon-covers.	Contextual bandits, epsilon-covers See also: (Beygelzimer et al, 2011). Discussion paper: Kaufmann et al, 2012. Bayes UCB: On Bayesian Upper Confidence Bounds for Bandit Problems.
Tue, Oct 14	Contextual bandits: reduction to classification.	Contextual bandits, Reduction to classification See also: (Agarwal et al, 2014). Discussion papers: Gyorgy et al, 2007. The on-line shortest path problem under partial monitoring. Agrawal and Goyal, 2012. Thompson Sampling for Contextual Bandits with Linear Payoffs.
Thu, Oct 16	Linear bandits.	Linear bandits See also: (Awerbuch and Kleinberg, 2008), (Dani et al, 2008) Discussion paper: Chapelle and Li, 2011. An Empirical Evaluation of Thompson Sampling.
Tue, Oct 21	Linear bandits: exponential weights.	More linear bandits See also: (Cesa-Bianchi and Lugosi, 2009), (Bubeck et al, 2012) Discussion paper: Bertsimas and Nino-Mora, 2000. Restless bandits, linear programming relaxations, and a primal-dual index heuristic.
Thu, Oct 23	Linear bandits: lower bounds.	Still more linear bandits See also: (Dani et al, 2008) Discussion papers: Bartok, 2013. A near-optimal algorithm for finite partial-monitoring games against adversarial opponents. Korda et al, 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits. Audibert et al. 2009. Exploration-exploitation trade-off using variance estimates in multi-armed bandits.
Tue, Oct 28	Linear bandits: stochastic mirror descent.	Mirror descent See also: Chapter 5 of Bubeck and Cesa-Bianchi, 2012. Discussion papers: Gopalan et al, 2013. Thompson Sampling for Complex Bandit Problems. Audibert et al, 2012. Regret in Online Combinatorial Optimization.
Thu, Oct 30	Linear bandits: stochastic mirror descent.	See also: Bubeck et al, 2012. Discussion papers: Russo and Van Roy, 2013. Learning to Optimize Via Posterior Sampling. Badanidiyuru et al, 2014. Resourceful Contextual Bandits.
Tue, Nov 4	Markov decision processes.	MDPs See also Slides based on Bertsekas, 2005. Discussion paper: Mannor and Shamir, 2011. From Bandits to Experts: On the Value of Side-Observations.
Thu, Nov 6	More Markov decision processes.	More MDPs Discussion papers: Salomon and Audibert, 2011. Deviations of stochastic bandit regret. Slivkins, 2009. Contextual Bandits with Similarity Information.
Tue, Nov 11	Veterans Day
Thu, Nov 13	Approximate methods for MDPs.	Approximate methods See also Chapter 6 of Bertsekas, 2012. Discussion papers: Hazan and Kale, 2011. Better Algorithms for Benign Bandits. Abernethy et al, 2012. Interior-Point Methods for Full-Information and Bandit Online Learning.
Tue, Nov 18	Guest lecture: Nikos Vlassis On the Computational Complexity of Stochastic Controller Optimization in POMDPs.	Discussion papers: Filippi et al, 2010. Parametric Bandits: The Generalized Linear Case. Bubeck et al, 2013. Bounded regret in stochastic multi-armed bandits.
Tue, Nov 25	Guest lecture: Mohammad Ghavamzadeh. Finite-Sample Analysis of Approximate DP Algorithms.	See also: ICML2012 Tutorial: Statistical Learning Theory in Reinforcement Learning and Approximate Dynamic Programming.
Thu, Nov 27	Thanksgiving
Tue, Dec 2	Final project presentations: CS 194 students.	Max and Walid; Dylan; Ming and Yuxun; Michael and Jeff; Ke; James.
Thu, Dec 4	Final project presentations: Stat 260 students.	Animesh; Andres and Yonatan; Kieren; Soeren and Soumendu; Jiung and Siyuan; Yannik; Auyon and Birce.