Information Theory and Statistics

Statistics 212A

Information Theory deals with a basic challenge in communication: How do we transmit information efficiently? In addressing that issue, Information Theorists have created a rich mathematical framework to describe communication processes with tools to characterize so-called fundamental limits of data compression and transmission.

What might Statisticians learn from Information Theory? Basic concepts like entropy and Kullback-Leibler divergence have certainly played a role in statistics. But so too have estimation frameworks like the Maximum Entropy principle; novel decompositions like ICA; and even model selection methodologies like AIC and the Principle of Minimum Description Length. In this course we will illustrate how the basic questions and tools of Information Theory relate to statistical practice and theory.

Tentative Schedule

First Week (8/30,9/1): Introduction. History on entropy. Overview of the course. Code. Decoding. Examples of real codes.

Reading: C. Shannon (1948). A mathematical theory of communication.

Second Week (9/6, 8): Codes and probability distributions: Kraft's inequality. Shannon's optimal source coding: entropy as the lower bound. Shannon code and Huffman code. Asymptotic Equal Partition (AEP).

Third Week (9/13, 15): Properties of Entropy. Differential entropy. Entropy rate of a stationary process. Plug-in entropy estimate. Limiting distribution. Examples: bioinformatics and neuroscience (Strong's estimate). Fano's inequality and implications in statistical estimation.

Fourth Week (9/20, 22): Maximum entropy principle. Redundancy. KL divergence with applications to document clustering via MDS.

Fifth Week (9/27, 29): 27: Relating KL to proper distances: L1, L2 and Hellinger. MLE when model is not and is misspecified.

Sixth week (10/4, 6): Mutual information and sufficiency. Fano's inequality.

Seventh week (10/11, 13): Minimax density estimation via KL. Shannon's channel capacity.

Eighth week (10/18, 20): Large deviation. Stein's Lemma. I-projection.

Ninth week (10/25, 27): 25: no class because Bin will be traveling. Make-up at end of semester. 27: ME as I-projection,fitting ME distributions via iterative scaling.

Tenth week (11/1, 3): Independent Component Analysis (ICA) through I-projection. Kolmogorov complexity (Ambuj and Joel).

Eleventh week (11/8, 10): Minimum Description Length (MDL) principle. Optimal coding of integers. Model selection problem and earlier approaches: AIC, Cp and BIC. Validity of a description length (lower bounds) in MDL.

Twelveth week (11/15, 17): Different MDL forms: two-stage, Predictive, mixture, and NML. Model selection in regression.

Thirteenth week (11/22, 24): 22: Coding predictors and responses in MDL for simultaneous clustering and prediction. 24: Thanksgiving holiday.

Fourteenth week (11/29, 12/1): The Lasso approach for model selection and boosting.

Fifteenth week (12/6, 8): Lossless Ziv-Lempel data compression and entropy estimation. Variable length Markov Chain model.

Sixteenth week (12/12, Monday): Make up class (1.5 hours): class project presentations.

Instructor

Professor Bin Yu
409 Evans Hall
University of California, Berkeley
binyu|@|stat.berkeley.edu

Office Hours

TBA
409 Evans Hall

Meeting

T/Th 11:00-12:30
330 Evans Hall

Grading

30%	Class participation and homework assignments
40%	Final project and in-class presentation
30%	Oral final exam

Textbooks

Thomas Cover and Joy Thomas
Elements of Information Theory

Imre Csiszar and Paul Shields #
Information Theory and Statistics: a Tutorial

Jorma Rissanen
Stochastic Complexity in Statistical Inquiry