My primary research interests lie in developing novel statistical approaches in bioinformatics.  Specific topics of current interest are: 1) prediction of functional elements in genome sequences; 2) identification of gene functional relationships, pathways and networks; and 3) translational bioinformatics: analysis of multiple-microarrays and integrative information retrieval for aiding disease diagnosis.

Predicting functional elements in genome sequences

·     We published theoretical studies on the asymptotic distribution of over- or under- represented words or relevant statistics in DNA sequences in Advances in Applied Probability (2002)1 and Proceedings of National Academy of Science (2002)2. This work contributes to form the fundamental mathematical foundations for many downstream sequence analyses, e.g., sequence comparison.

·      We published two studies on modeling and identifying fuzzy words (e.g., protein binding motifs) in Journal of Computational Biology (2004, 2005)3,4. The developed algorithms, based on solid probabilistic principles, received wide-spread attention.

·     Operating in cooperation with the ENCODE Consortium (an international collaboration to annotate all functional elements in human genome), I contributed to develop a statistical method for studying the joint distribution of genomic features. Application of the approach to ENCODE data led to several key biological observations. The results were included in the Nature (2007)5 paper. The manuscript on the statistical aspects of the method is under review for Annals of Applied Statistics.

(Method details) Our method was predicted on a novel segmented block-bootstrap, as well as analytical derivations of variances in discrete stochastic process. Our method enables the detection of relationships between various genomic features. Prior to our work, no statistically rigorous approach to this analysis existed in the literature. This approach has been adopted by a number of groups. We expect it will be widely used in most areas of genomic studies in the near future.

·     We developed a simple yet effective approach for genome-wide transcriptome identification by multiple RNA tiling arrays. The approach is novel and biologically insightful.

(Method details) Our generative model carefully elaborates the sources of randomness in multiple tiling arrays. Based on the model, we derived two statistics, monotonic to the model parameters of interest, to identify the transcribed regions. The use of these statistics nicely bypasses the difficulty in fitting the model, e.g., estimating the probe affinity.  We demonstrated the effectiveness of our method using real data. Software packages for Windows and Unix operating systems are available.

Identifying gene functional relationships, pathways and networks

·     We published in Genome Biology (2004)6 the first model-based method for grouping functionally related genes by multiple SAGE libraries of expression data. This method has been widely applied and led to many important biological and clinical findings. The Cepko Lab at Harvard University, a world leading lab in retinal development and disease, applied the method to a mouse retina SAGE dataset. This collaborative effort resulted in a publication in PLoS Biology (2004)7.  Dr. Polyak, a well-known cancer genomics expert, applied the method to the breast cancer data. This joint work was published in Cancer Cell (2004)8 (highlighted in the cover story). 

Since publication, our approach has attracted attention and been referenced by researchers around the world. In particular, I was invited to contribute a book chapter on it in Methods in Molecular Biology (2008)9, in addition to an invited article in Chance (2006)10. We later achieved an extension and improvement of the approach by adapting the same model to an appropriate feature space (BMC Bioinformatics, 2007)11. This work was marked as highly accessed two weeks after its publication. A successful application of the method to a Maize microarray dataset resulted in a joint publication with the Feldman lab in Plant Molecular Biology (2006)12. An integrated software package for both methods, named GEA, is available. 

·     I contributed to two publications in Bioinformatics (2007)14 and in Nature Biotechnology (2005)13 for inferring gene relationships by cross-platform microarray data. The one published in Nature Biotech (highlighted in Nature Reviews Genetics) introduces a novel concept of second-order co-expression, which can be used to identify genes of the same function yet without direct coexpression patterns, and to reconstruct regulatory networks.

·     We published in Journal of the American Statistical Association (2009)15 a novel framework for estimating gene correlations via controlling experimental dependencies. This study provides a conceptual advance to the analysis of microarray gene expression data.

(Motivation of the study) Microarray data from an increasing number of biologically interrelated experiments now allows for more complete portrayals of functional gene relationships. In current studies of gene relationships, the presence of expression dependencies attributable to the biologically interrelated experiments, however, are widely ignored. When unaccounted for, these experiment dependencies can result in inaccurate inferences of functional gene relationships, and hence incorrect biological conclusions.

(Method details) We developed a framework, consisting of a model and an estimation procedure, to infer gene relationships when there are two-way dependencies in the gene expression matrix (the gene-wise and experiment-wise dependencies). The main aspect of the framework is using a Kronecker product covariance matrix to model the gene-experiment interactions. The resulting novel gene co-expression measure, named Knorm correlation, has a smaller estimation variance than the widely used Pearson coefficient. The implementation of the iterative estimation procedure requires some monitoring and control of the quality of the two estimated correlation matrices and their inverses. A follow-up study that extends the model to find pathway genes is ongoing.

(Significance of the work) It is the first methodological study for concurrently estimating the gene-wise and experiment-wise dependencies from an expression matrix. The difficulty in constructing and implementing the model comes from the high-dimensional, complex nature of gene expression data: the number of genes is far larger than the number of experiments; only replicates of “expression vectors” (an expression vector corresponds to an array or a column in an expression matrix) are available; and we do not really observe the replicates of “whole expression matrices” (this last point, in particular, has been widely ignored or misunderstood). Our method, with appealing intuitive explanations, is biologically insightful.

Translational Bioinformatics research

This new area of inquiry concerns the analysis and reasoning of the enormous quantity of life science data in public repositories. Below we summarize two projects we did along this direction.

(Motivation of our study) The rapid accumulation of microarray gene expression data has offered unprecedented opportunities to study human diseases. The NCBI Gene Expression Omnibus (GEO) repository is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, to date, this resource has been far from fully utilized. It could serve as a rich source of information for disease diagnosis, i.e. screening across the enormous number of disease expression datasets holds the promise to narrow down disease candidates in an automated fashion. Such expression-based automated diagnosis would be particularly useful when the potential disease is not obvious or when the disease lacks biochemical diagnostic tests.

We aim to take the first steps toward turning the NCBI GEO repository into an automated disease diagnosis system. Our study provides an important application for the massive public microarray data - potentially a quantum leap ahead of most current diagnosis approaches based on qualitative information.

·    In the BMC Bioinformatics (2009)16 paper, we tested the feasibility of disease classification using the large amount of heterogeneous microarray datasets from NCBI GEO.

(Study details) In this study, we overcame several challenges: 1) To remove the cross-platform data incompatibilities, we derived standardized profiles (vectors) whose components reflect the level and direction of differential expression of disease-related genes. The differential expression is the intrinsic characteristic to the disease and hence carries the most stable information regardless the platform or lab differences. About 9000 micorarray experiments were included in our study. 2) We mapped the heterogeneous phenotypic text information to concepts in the Unified Medical Language System. This enables us to categorize the thousands of microarray datasets into different disease classes. 3) We designed a classification approach named ManiSVM. It integrates Manifold data transformation with SVM learning. Real data analysis showed that ManiSVM is advantageous. The Manifold data transformation is critical to ensure an effective learning, since the data are very noisy and heterogeneous.

·    In a manuscript presently in preparation, we reported the first study to transform the NCBI GEO repository into an automated disease diagnosis database.

(Study details) We developed an approach to robustly diagnose a query expression profile by jointly utilizing the quantitative genomic data and the phenotypic text data. We formulated the question as a hierarchical multi-label classification problem. That is, we aim to categorize a query expression profile into multiple relevant disease classes along a hierarchical disease taxonomy. We developed a two-stage Bayesian learning approach for the problem. The approach first builds independent Bayesian classifiers for each disease class, followed by the integration of individual predictions with a Bayesian network model to allow collaborative error-correction across classes in the hierarchy.

(Significance of the work) This disease diagnosis problem is much more challenging than the many existing efforts in the literature, where the disease-query problems were usually based only on the gene expression data or on the phenotypic metadata. A sensitive integration of multiple data-types requires a full and careful consideration of the complex data properties as well as the various sources of noise (remember that our study deals with thousands of microarray experiments). Such modeling cannot be achieved by a direct application of any existing prediction or machine learning approaches without massive data processing. Our Bayesian learning approach, which allows interrogating the genomic and phenotypic data in a unified probabilistic system, constitutes an advance in both scale and depth. We have demonstrated many exciting features of our approach. Particularly, using the established diagnosis database, we constructed a phenome map, showing a global relationship landscape of disease phenotypes.

Other studies

As a statistician, I collaborated with several experimental biologists and scientists on the analysis of their laboratory experiments.

·    With the Sohn group in mechanical engineering at UC Berkeley, we analyzed the data generated from a newly designed device for separating different types of cells (Lab Chip (2008)17). I serve as a co-PI on a three-year NSF grant that supports to continue this study and the collaboration.

·    With Dr. Wang at Yale University, we investigated the association between fractional anisotropy in anterior cingulum and single nucleotide polymorphisms (J. of Psychiatry & Neuroscience (2009)18).

·    With the Feldman group in plant microbial biology at UC Berkeley, we investigated the relationship between two transcriptionally distinct stem cell populations in Maize. This work is submitted.

Software packages we developed

·    GEA (Gene Expression Analyzer; http://cell.rutgers.edu/gea/): A tool for clustering and significant analysis of SAGE and Microarray gene expression data6,11.

·    Knorm (http://cran.r-project.org/web/packages/knorm): An appealing statistical method for gene association inference across multiple dependent experimental conditions15.

·    LMM (available upon request): A method for predicting transcription factor binding sites by evaluating a candidate site in a local genomic context3.

·    TilingAnalyzer (available upon request): A method for analyzing multiple RNA tiling arrays.

These packages have received a lot attention, e.g., GEA has about over 100 users across the world.  

 

All the above work has been based on close collaborations with five research groups: the Bickel group at UC Berkeley, the Zhou lab at USC, the Feldman lab at UC Berkeley, the Cai lab at Rutgers University, and the Sohn lab at UC Berkeley. The previous successful collaborative experiences strengthen my confidence in future collaborations.

 

 

References

1.      Huang H (2002). Error bounds on multivariate normal approximations for word count statistics. Advances in Applied Probability, 34(3): 559-586.

2.      Lippert RA, Huang H, Waterman MS (2002). Distributional regimes for the number of k-word matches between two random sequences. Proc Natl Acad Sci.USA, 99(22):13980-9.

3.      Huang H, Kao MJ, Zhou XJ, Liu JS, Wong WH (2004). Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification.” Journal of Computational Biol, 11(1):1-14.

4.      Zhao X, Huang H, Speed T (2005). Finding short DNA motifs using permuted Markov models. Journal of Computational Biol, 12(6): 894-906

5.      ENCODE Consortium (2007).  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 447, 799-816.

6.      Cai L*, Huang H*, Blackshaw S, Liu JS, Cepko CL, Wong WH (2004). Clustering analysis of SAGE data using a Poisson approach. Genome Biology, 5(7):R51.

*co-first authors

7.      Blackshaw S et al. (2004). Genomic analysis of mouse retinal development. PLoS Biol, 2(9):E247.

8.      Allinen M et al. (2004). Molecular characterization of the tumor microenvironment in breast cancer. Cancer Cell, 6(1):17-32.

9.      Huang H, Cai L, Wong WH. (2008) Clustering analysis of SAGE transcription profiles using a Poisson approach. Methods Mol Biol. 387:185-98

10.    Huang H, Kim K (2006). Unsupervised clustering analysis of gene expression. Chance, vol. 19, No.3.

11.    Kim K, Zhang S, Jiang K, Cai L, Lee IB, Feldman LJ, Huang H* (2007). An Efficient Measure of Similarity between Gene Expression Profiles through Data Transformations. BMC Bioinformatics, 8:29.

*corresponding author

12.    Jiang K, Zhang S, Lee S, Tsai G, Kim K, Huang H, Zhu T, Feldman LJ (2006). Transcription Profile Analyses Identify Genes and Pathways Central to Root Cap Functions in Maize. Plant Molecular Biology, 60(3):343-63.

13.    Zhou XJ, Kao MJ, Huang H, Wong A, Nunez-Iglesias J, Aparicio O, Morgan T, Wong WH (2005). Functional annotation and network reconstruction through cross-platform integration of microarray data. Nature Biotech, 23(2):238-43.

14.    Huang Y, Li H, Hu H, Yan X, Waterman MS, Huang H, Zhou XJ (2007). Systematic Discovery of Functional Modules and Context-Specific Functional Annotation of Human Genome. Bioinformatics, 23(13):i222-i229.

15.    Teng S, Huang H* (2009). A Statistical Framework to Infer Functional Gene Associations from Multiple Biologically Interrelated Microarray Experiments. JASA, June 2009, Vol. 104, No. 486.

*corresponding author 

16.    Liu C, Hu J, Kalakrishnan M, Huang H*, Zhou XJ* (2009) Integrative Disease Classification Based on Cross-platform Microarray Data. BMC Bioinformatics, 2009 Jan;10 Suppl 1:S25.

*co-corresponding author.

17.    Carbonaro A et al. (2008). Cell Characterization Using A Protein-Functionalized Pore, Lab Chip, 8(9):1478-85.

18.    Wang F et al. (2009). Neuregulin 1 Genetic Variation and anterior cingulum integrity in patients with schizophrenia and healthy controls. Journal of Psychiatry & Neuroscience, 2009 May;34(3):181-6.