The materials for lectures 1-3 are mainly from Statistics (3rd ed.) by Freedman, Pisani and Purves. Publisher: Norton

 

Midterm II [download], Due at 5pm, Friday 05/09/2008

Data [clustering] [TrueC.dat]

Please either email your work to me or drop it in my office or put a hard copy in my mailbox at 367 Evans Hall. Good Luck!

 

HW and LAB assignments:

HW #1 [download], due at the end of the lab session, Wednesday 02/13/2008

 [Solution]

HW #2 – part 1 [download], due at the end of the lab session, Wednesday 02/27/2008

                part 2 [download], due at the end of the lab session, Wednesday 02/27/2008

    [Solution]

LAB#1 – part 1 [download], due at the end of the lab session, Wednesday 03/12/2008

  [123nt.dat] [purine100.txt] [purine1000.txt] [purine10000.txt] [gtattg_5kb_count.txt]

  [gtattg_20kb_count.txt] [tataat_5kb_count.txt] [tataat_20kb_count.txt]

               

  part 2 [download] [hmm.txt], due 5:00pm Friday 03/21/2008

 

HW #3 – [download], due Thursday in class 03/13/2008

Data [f8i2.fasta] [f8i2_perm.fasta]

   [Solution]

 

HW #4 – [download] due at the end of the lab session, Wednesday 04/23/2008

 

LAB#2 – [download] due at the end of the lab session, Wednesday 04/30/2008

                Data [ko8.lab]

 

 

Midterm I [download], Due at the beginning of class, Thursday 03/06/2008

Note that there is no class on Tuesday 03/04/2008 !

 

Note that there is no lecture on Tuesday 04/01/2008!

Please read the following materials: [1] and [PDF] for a review; [PDF] and [PDF] for False Discovery Rate

 

The Thursday lecture will cover Hierarchical and K-means clustering analysis

[PDF]; [PDF]; [PDF];

 

Midterm II [download], Due at 5pm, Friday 05/09/2008

Data [clustering] [TrueC.dat]

Please either email your work to me or drop it in my office or put a hard copy in my mailbox at 367 Evans Hall. Good Luck!

 

 

Lecture notes:

Jan 22 Introduction to the course and Bioinformatics [syllabus] [doc]

Jan 24 Introduction to Probability [doc] [doc]

Jan 29 Introduction to Probability and Statistics (I) [PDF]; More on distributions [PDF]

Jan 31 Introduction to Probability and Statistics (II) [PDF]; Mendel’s genetics [doc]

 

Feb5, Feb 7, Feb12: Statistics in Genome Assembly

Genome Sequencing [PDF];

Statistics in shotgun sequencing [PDF];

Some reading materials:

  History of Genomic Sequencing [1], [2]

Re-sequencing [PDF]

Brief introduction of bio-chemistry in sequencing [PDF]

 

The materials of “Statistics in shotgun sequencing” are from

(Hard copies of above two sections will be distributed in class)

 

Outline of the lecture on Feb 12:

1)     Analysis of one sequence

a.      Shotgun sequencing

                                                    i.     coverage theorem

                                                  ii.     mean number of contigs

b.     Modeling signals in DNA (brief intro)

                                                    i.     transcription factor binding sites identification

                                                  ii.     Hidden Markov Model (HMM) for gene annotation (coding regions vs noncoding regions)

2)     Analysis of Multiple sequences

a.      Sequence alignment (brief intro)

b.     Alignment-free sequence comparison

                                                    i.     Frequency comparison (Chi-square test)

                                                  ii.     D2 statistic 

 

Feb 12, Feb 14, Feb 19, Feb 21, Feb 26, Feb 28: DNA or protein sequence alignment and comparison

Markov Chain and Hidden Markov Model; [reference link]

Viterbi algorithm [link1] [link2] and Dynamic programming [PDF]

 (Also please read the Chapter 4 and Chapter 11 of Ewens and Grant’s book “Statistical Methods in Bioinformatics: An introduction”. A hard copy of some sections from those chapters will be distributed in class on Feb 19)

 

Random Walk and BLAST theory; [PDF]

Maximum Likelihood Estimates;

 

Mar 4 Midterm review

Mar 6 Midterm I

 

Mar 11, 13, 18, 20: Gene Expression Data Analysis (I)

SAGE data [PDF]; please read the introduction and the method sections

Reference [2]

 

Please watch the animation [1] for DNA arrays (highly recommended)

Affymetrix arrays” [PDF] 

 

 “Statistics in Microarray gene expression analysis” [PDF]

 Microarray gene expression analysis (I)” [PDF]

 “Statistical methods for finding differentially expressed gene” [PDF]

“False Discovery Rates, Maximum Likelihood Estimates, Bayes’s Rule” [PDF]

 

Outline of the lecture on March 18:

        Two sample t-tests for finding differentially expressed genes

1.     calculating the test statistics

a.      samples with equal or un-equal sizes

b.     samples with equal or un-equal variances

2.     evaluating the statistics

a.      Type I errors; False positives; p-values; False positive rate

b.     Type II errors; False negatives; power; False negative rate

3.     Q-Q plots

a.      Comparing the distributions of two samples

b.     Comparing the distribution of one sample against a particular distribution

c.      Visualizing outliers (significant genes)

 

Microarray gene expression analysis (II)” [PDF] (optional)

More reading materials for False Discovery Rates [PDF] (optional)

 

Mar 25, Mar 27  Spring break

 

Apr 1, 3, 8, 10, 15, 17, 22, 24: Gene Expression Data Analysis (II): clustering and classification

Note that there is no lecture on Tuesday 04/01/2008!

Please read the following materials: [1] and [PDF] for a review; [PDF] and [PDF] for False Discovery Rate

 

The Thursday (04/03) lecture will cover Hierarchical and K-means clustering analysis

[PDF]; [PDF]; [PDF];

 

Hierarchical clustering

K-means clustering

Self-organizing maps

Principal Component Analysis (PCA) [PDF];

Spectral Clustering [PDF];

Correspondence Analysis

Nonnegative Matrix Factorization

 

If time permits, we will also cover the methods for finding transcription factor binding motifs.

 

Apr 29 Midterm II review

May 1 Midterm II

 

May 6, 8, 13 Final Presentations