My
primary research interests lie in developing novel statistical approaches in
bioinformatics. Specific topics of
current interest are: 1) prediction of functional elements in genome sequences;
2) identification of gene functional relationships, pathways and networks; and
3) translational bioinformatics:
analysis of multiple-microarrays and integrative information retrieval for
aiding disease diagnosis.
Predicting functional elements in genome sequences
· We published theoretical
studies on the asymptotic distribution of over- or
under- represented words or relevant statistics in DNA sequences in Advances in Applied Probability (2002)1
and Proceedings of National Academy of
Science (2002)2. This work contributes to form the fundamental mathematical foundations for many downstream sequence analyses, e.g., sequence comparison.
· We published two studies on modeling and
identifying fuzzy words (e.g., protein binding motifs) in Journal of
Computational Biology (2004, 2005)3,4. The developed algorithms, based
on solid probabilistic principles, received wide-spread attention.
· Operating in cooperation with the ENCODE Consortium (an international collaboration to annotate all functional elements in human genome), I contributed to develop a
statistical method for studying the joint distribution of genomic
features. Application of the
approach to ENCODE data led
to several key biological observations. The results were included in the Nature
(2007)5 paper. The manuscript on the statistical
aspects of the method is
under review for Annals of Applied
Statistics.
(Method
details) Our method was predicted on a novel “segmented block-bootstrap”,
as well as analytical derivations of variances in
discrete stochastic process. Our method enables the detection of relationships between various
genomic features. Prior to our work, no statistically rigorous approach to this analysis
existed in the literature. This approach has been adopted by a
number of groups. We
expect it will be widely used in most areas of
genomic studies in the near future.
· We developed a simple
yet effective approach for genome-wide transcriptome identification by multiple RNA tiling arrays. The
approach is novel and biologically insightful.
(Method
details) Our generative model carefully elaborates the sources of randomness in multiple tiling arrays. Based
on the model, we derived two statistics, monotonic to the model parameters of
interest, to identify the transcribed regions. The use
of these statistics nicely bypasses the difficulty
in fitting the model, e.g.,
estimating the probe affinity. We demonstrated the effectiveness of
our method using real data. Software packages for Windows and Unix operating systems are available.
Identifying gene functional
relationships, pathways and networks
· We published in Genome Biology (2004)6 the first model-based
method for
grouping functionally related genes by multiple SAGE libraries of expression data. This method has been widely applied and led
to many important biological and clinical findings.
The Cepko Lab at
Since
publication, our approach has attracted attention and been referenced by researchers around the
world. In particular, I
was invited to
contribute a book chapter on it in Methods
in Molecular Biology (2008)9, in addition to an invited article in Chance (2006)10. We later achieved an extension and
improvement of the approach by
adapting the same model to an appropriate feature space (BMC Bioinformatics, 2007)11. This
work was marked as highly accessed two weeks after its publication. A successful application of the method to
a Maize microarray dataset resulted in a joint publication with the Feldman lab
in Plant Molecular Biology (2006)12. An integrated software package
for both methods, named GEA, is
available.
· I contributed to two publications in Bioinformatics (2007)14 and in Nature Biotechnology (2005)13
for inferring
gene relationships by cross-platform microarray data. The one published in Nature Biotech (highlighted in Nature Reviews Genetics) introduces a novel concept of second-order co-expression, which can be
used to identify genes of the same function yet without direct coexpression patterns, and to reconstruct regulatory networks.
·
We published in Journal of the
American Statistical Association (2009)15 a novel framework for estimating gene correlations via controlling
experimental dependencies. This study provides a
conceptual advance to the analysis of microarray gene expression data.
(Motivation
of the study) Microarray
data from an increasing number of biologically interrelated experiments now allows for more complete portrayals
of functional gene relationships. In current studies of gene relationships, the
presence of expression dependencies attributable to the biologically
interrelated experiments, however, are widely ignored. When unaccounted for, these experiment
dependencies can result in inaccurate inferences of functional gene
relationships, and hence incorrect biological conclusions.
(Method
details) We developed a framework, consisting of a model and an estimation
procedure, to infer gene relationships when
there are two-way dependencies in the gene expression matrix (the gene-wise and
experiment-wise dependencies). The main aspect of the framework is using a Kronecker product covariance
matrix to model the gene-experiment interactions. The resulting
novel gene co-expression measure, named Knorm correlation, has a smaller estimation variance than the widely used Pearson coefficient. The implementation of the iterative
estimation procedure requires some monitoring and control of the quality of the
two estimated correlation matrices and their inverses. A
follow-up study that extends the model to find pathway genes is ongoing.
(Significance
of the work) It is
the first methodological study for concurrently estimating the gene-wise and experiment-wise
dependencies from an expression matrix. The difficulty in constructing
and implementing the model comes from the high-dimensional, complex
nature of gene expression data: the number of genes
is far larger than the number of experiments; only
replicates of “expression vectors” (an expression vector
corresponds to an array or a column in an expression matrix) are available; and
we do not really observe the replicates of “whole expression
matrices” (this last point, in particular, has been widely ignored or
misunderstood). Our method, with appealing intuitive explanations, is
biologically insightful.
Translational Bioinformatics research
This new area of inquiry concerns the analysis and reasoning
of the enormous quantity of life science data in public repositories. Below we
summarize two projects we did along this direction.
(Motivation of our study) The
rapid accumulation of microarray gene expression data has offered unprecedented opportunities to study
human diseases. The NCBI
Gene Expression Omnibus (GEO) repository is currently the largest database that
systematically documents the genome-wide molecular basis of diseases. However, to date, this
resource has been far from fully utilized. It could serve as a rich source of information for
disease diagnosis, i.e. screening across the enormous number of disease expression datasets holds the
promise to narrow down disease candidates in an automated fashion. Such
expression-based
automated diagnosis would be particularly useful when the potential disease is
not obvious
or when the disease lacks biochemical diagnostic tests.
We aim to take the first steps toward turning the NCBI GEO repository into an
automated disease diagnosis system. Our study provides an important
application for the massive public microarray
data - potentially a quantum leap ahead of most current diagnosis approaches based on
qualitative information.
·
In the BMC Bioinformatics
(2009)16 paper, we tested the feasibility of disease classification
using the large amount of heterogeneous
microarray datasets from NCBI
GEO.
(Study details) In this study, we
overcame several
challenges: 1) To
remove the cross-platform data incompatibilities, we derived standardized profiles (vectors) whose
components reflect
the level and direction of differential expression of disease-related genes. The differential expression is the
intrinsic characteristic to the disease and hence carries the most stable
information regardless the platform or lab differences. About 9000 micorarray
experiments were included in our study. 2) We mapped the heterogeneous phenotypic text information to
concepts in the Unified
Medical Language System.
This enables us to categorize the thousands of microarray datasets
into different disease
classes. 3) We designed a classification
approach named ManiSVM. It integrates Manifold data
transformation with
SVM learning. Real data
analysis showed that ManiSVM is advantageous. The Manifold data transformation is
critical to ensure an effective learning, since the data are very noisy and
heterogeneous.
·
In a manuscript presently in preparation, we
reported the first study to transform
the NCBI GEO repository
into an automated
disease diagnosis database.
(Study details) We developed an
approach to robustly
diagnose a query expression profile by jointly utilizing the quantitative genomic data and the phenotypic text data. We
formulated the question as a hierarchical multi-label classification problem. That is, we aim to
categorize a query expression profile into multiple relevant disease classes along a hierarchical disease
taxonomy. We developed a two-stage Bayesian learning approach for the problem.
The approach first
builds
independent Bayesian classifiers for each disease class, followed by the integration of individual
predictions with a Bayesian network model to allow collaborative
error-correction across classes in the hierarchy.
(Significance of the work) This disease diagnosis
problem is much more challenging than the many existing efforts in the literature, where the
disease-query problems were usually based only on the gene expression data or on the phenotypic
metadata. A sensitive integration of multiple data-types requires a full and careful consideration
of the complex data properties as well as the various sources of noise (remember that our study deals with
thousands of microarray experiments). Such modeling cannot be
achieved by a direct application of any existing prediction or machine learning approaches without massive data
processing. Our Bayesian learning approach, which allows interrogating the genomic and phenotypic
data in a unified
probabilistic system, constitutes an advance in both scale and depth. We have demonstrated many exciting
features of our
approach. Particularly,
using the
established diagnosis database, we constructed a phenome map, showing a global relationship landscape of
disease phenotypes.
Other studies
As a statistician, I
collaborated with several experimental biologists and scientists on the analysis of their laboratory
experiments.
· With the Sohn group in mechanical
engineering at UC Berkeley, we analyzed the data generated from a newly
designed device for separating different types of cells (Lab Chip (2008)17).
I serve as a co-PI on a three-year NSF grant that supports to continue this
study and the collaboration.
· With Dr. Wang at
· With the Feldman group in plant
microbial biology at UC Berkeley, we investigated the relationship between two
transcriptionally distinct stem cell populations in Maize. This work is
submitted.
Software packages we
developed
·
GEA (Gene
Expression Analyzer; http://cell.rutgers.edu/gea/): A
tool for clustering and significant analysis of SAGE and Microarray gene
expression data6,11.
·
Knorm (http://cran.r-project.org/web/packages/knorm):
An appealing statistical method for gene association inference across multiple
dependent experimental conditions15.
·
LMM (available upon request): A method for predicting transcription
factor binding sites by evaluating a candidate site in a local genomic context3.
·
TilingAnalyzer (available upon request): A
method for analyzing multiple RNA tiling arrays.
These packages have received a lot attention, e.g., GEA has about over 100 users across the
world.
All the above work has been based on close collaborations with five research groups:
the Bickel group at UC Berkeley, the Zhou lab at USC,
the Feldman lab at UC Berkeley,
the Cai lab at
References
1.
Huang
H (2002). Error bounds on multivariate normal approximations for word count
statistics. Advances in Applied Probability, 34(3): 559-586.
2. Lippert RA, Huang H, Waterman MS (2002).
Distributional regimes for the number of k-word matches between two random
sequences. Proc Natl Acad Sci.USA, 99(22):13980-9.
3.
Huang
H, Kao MJ, Zhou XJ, Liu JS, Wong WH (2004). Determination of local statistical significance
of patterns in Markov sequences with application to promoter element
identification.” Journal of Computational Biol, 11(1):1-14.
4.
Zhao X, Huang H, Speed T (2005). Finding short DNA motifs using
permuted Markov models. Journal of Computational Biol, 12(6): 894-906
5.
ENCODE
Consortium (2007). Identification
and analysis of functional elements in 1% of the human genome by the ENCODE
pilot project. Nature. 447,
799-816.
6.
Cai
L*, Huang H*, Blackshaw S, Liu JS, Cepko CL, Wong WH (2004). Clustering analysis
of SAGE data using a Poisson approach. Genome Biology, 5(7):R51.
*co-first
authors
7.
Blackshaw
S et al. (2004). Genomic analysis of mouse retinal development. PLoS Biol, 2(9):E247.
8.
Allinen
M et al. (2004). Molecular characterization of the tumor microenvironment in
breast cancer. Cancer Cell, 6(1):17-32.
9.
Huang
H, Cai L, Wong WH. (2008) Clustering analysis of SAGE transcription profiles
using a Poisson approach. Methods Mol Biol. 387:185-98
10.
Huang
H, Kim K (2006). Unsupervised clustering analysis of gene expression. Chance,
vol. 19, No.3.
11.
Kim
K, Zhang S, Jiang K, Cai L, Lee IB, Feldman LJ, Huang H* (2007). An Efficient Measure
of Similarity between Gene Expression Profiles through Data Transformations. BMC Bioinformatics, 8:29.
*corresponding
author
12.
Jiang
K, Zhang S, Lee S, Tsai G, Kim K, Huang H, Zhu T, Feldman LJ (2006).
Transcription Profile Analyses Identify Genes and Pathways Central to Root Cap
Functions in Maize. Plant Molecular Biology, 60(3):343-63.
13.
Zhou
XJ, Kao MJ, Huang H, Wong A, Nunez-Iglesias J, Aparicio O, Morgan T, Wong WH (2005).
Functional annotation and network reconstruction through cross-platform
integration of microarray data. Nature Biotech, 23(2):238-43.
14.
Huang
Y, Li H, Hu H, Yan X,
Waterman MS, Huang H, Zhou XJ
(2007). Systematic Discovery of Functional Modules and Context-Specific
Functional Annotation of Human Genome. Bioinformatics, 23(13):i222-i229.
15.
Teng
S, Huang H* (2009). A Statistical Framework to Infer Functional Gene
Associations from Multiple Biologically Interrelated Microarray Experiments.
JASA, June 2009, Vol. 104,
No. 486.
*corresponding
author
16.
Liu
C, Hu J, Kalakrishnan M, Huang H*, Zhou XJ* (2009) Integrative
Disease Classification Based on Cross-platform Microarray Data. BMC
Bioinformatics, 2009 Jan;10 Suppl 1:S25.
*co-corresponding
author.
17.
Carbonaro A et al. (2008). Cell
Characterization Using A Protein-Functionalized Pore, Lab Chip, 8(9):1478-85.
18.
Wang
F et al. (2009). Neuregulin 1 Genetic Variation and anterior cingulum integrity
in patients with schizophrenia and healthy controls. Journal of Psychiatry & Neuroscience, 2009 May;34(3):181-6.