Targeted Biomarker Discovery

class: center, middle, inverse, title-slide

# Targeted Biomarker Discovery
## Biostatistics Admit Day
### <a href="http://nimahejazi.org">Nima Hejazi</a>
### 20 March 2017

---

# Accessing these slides

### View online:
[goo.gl/HZzosu](https://goo.gl/HZzosu)

### Via git:

```bash
git clone https://github.com/nhejazi/talk_admitday.git
```

???

- This talk will focus on challenges in analyzing high-dimensional biological
data, with a brief introduction to common problems.

- We will look at how standard approaches from statistical causal inference
(esp., _Targeted Learning_) can be extended to this class of data.

---
class: inverse, center, middle

# Statistics in High-Dimensional Biology

---

# Biological Sequencing I

## Why?

- We can probe biological and health processes at a very fine level.

- Learn more about the molecular basis of health and disease.

- Querying different genomic processes:
  1. DNA/RNA expression
  2. Protein expression
  3. Epigenetics (e.g., DNA methylation)

???

- We are involved in research with health applications and the creation of
biological domain knowledge.

- Of the list given, the top 2 together comprise the "central dogma of molecular
biology."

- There's a fascinating array of processes to be studied and much room for
statistical innovation.

---

# Biological Sequencing II

## Who?

- Experimental scientists in
  1. environmental epidemiology
  2. molecular biology
  3. bioengineering
  4. neuroscience

???

- From the diversity of research areas, you're sure to find something
interesting to work on.

- Just a list I came up with off the top of my head...before my morning coffee.

---

# Biological Sequencing III

## How?

- (some) Popular biotechnology:
  1. Microarrays (genes, CpG sites, etc.)
  2. RNA-Seq
  3. Single-Cell RNA-Seq

???

- There are numerous challenges in analyzing such data sets.

- High-dimensional? personally, `$n \in (4, 125)$`, `$g \in (10000, 850000)$`.

---

# Biological Sequencing Data

Conventions differ in genomics:

`$\begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1g} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2g} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \dots & x_{ng} \end{bmatrix} \xrightarrow[]{\text{transpose}} \begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{g1} & x_{g2} & x_{g3} & \dots & x_{gn} \end{bmatrix}$`

__n.b.__, `$n << g$`

- standard practice: _subjects in rows, variables in columns_

- genomics: _genes in rows, subjects in columns_

- Why this awful convention?

- Thanks, Microsoft (Excel)...the root of all evil (in data science).

- "Doing data science with spreadsheets is like drunk driving."

???

- This threw me off when first starting to work in this area.

- We'd like to make claims about the genes, so why are they in the rows? Excel.

- Quote from podcast ("Not so standard deviations"), originally P.B. Stark.

---

# Targeted Learning

- A framework for causal inference and variable importance analysis.

- Let's represent observed data as `$O = (W, A, Y)$`

- `$W$`: baseline variables (e.g., age, sex, SES)

- `$A$`: exposure/treatment (e.g., benzene)

- `$Y$`: outcome of interest (e.g., gene expression)

- __goal:__ Estimate the effect of a treatment (A) on an outcome (Y) while
controlling for baseline covariates (W)

- Target parameters like the "average treatment effect" (ATE):

`$$\Psi(P_n^*) = E[E[Y \mid A = 1, W] - E[Y \mid A = 0, W]]$$`

???

- All of this is covered in a series of 1st-year courses.

- Significant area of research in our department (M. van der Laan, A. Hubbard,
M. Petersen).

---

# R package: `biotmle`

- What does this package do?

_Biomarker identification, combining targeted learning with moderated (empirical
Bayes) statistics to obtain conservative, robust estimates, with inference._

- Moderated statistics and targeted learning:

`$$\tilde{t}_b = \frac{\sqrt{n}(\Psi_b(P_n^*) - \Psi_b(P_0))}{\tilde{S}_b}$$`
--

```r
library(devtools)
devtools::install_github("nhejazi/biotmle")
```

```r
library(biotmle)
```

```
## biotmle: Targeted Learning for Biomarker Discovery
```

```
## Version: 0.99.3
```

```r
data(illuminaData)
```

???

- A working implementation of a targeted learning approach to biomarker
discovery using moderated statistics.

- Next, we'll walk through analyzing some data.

---
class: inverse, middle, center

# Data Analysis with `biotmle`

---

# Baseline covariates (W)

```r
# W - age, sex, smoking
W <- illuminaData %>%
  dplyr::select(which(colnames(.) %in% c("age", "sex", "smoking"))) %>%
  dplyr::mutate(
    age = as.numeric((age > quantile(age, 0.25))),
    sex = I(sex),
    smoking = I(smoking)
  )
```

```r
head(W)
```

```
##   age sex smoking
## 1   1   1       1
## 2   1   1       1
## 3   0   1       2
## 4   1   2       2
## 5   1   2       2
## 6   1   2       1
```

???

- Our department revolves around using R. You're more than welcome to use
whatever programming language you'd like, but you'll have to learn R if you want
to communicate and collaborate.

- Probably wouldn't hurt to start learning now, if you don't know it already.

---

# Exposure of interest (A)

```r
# A - benzene exposure (discretized)
A <- illuminaData %>%
  dplyr::select(which(colnames(.) %in% c("benzene")))
A <- A[, 1]
```

```r
unique(A)
```

```
## [1] 1 3 2
```

```r
table(A)
```

```
## A
##  1  2  3 
## 42 59 24
```

???

- We discretize the exposure/treatment to make it fit the form of the parameter
that we saw before (the ATE).

- Decent distribution of observations across the levels of A (though there are
fewer individuals in the highest exposure level).

---

# Outcome of interest (Y)

```r
# Y - genes
Y <- illuminaData %>%
  dplyr::select(which(colnames(.) %ni% c("age", "sex", "smoking", "benzene",
                                         "id")))
geneIDs <- colnames(Y)
```

```r
dim(Y)
```

```
## [1]   125 22177
```

```r
head(Y[, 1:7])
```

```
##    6960451  2600731  2120309  7510608  1570494  6520451  5960017
## 1 450.6910 2778.857 119.8120 203.8761 135.3883 222.8536 200.8315
## 2 339.4663 2856.571 113.6889 228.0108 126.4207 219.0222 185.6719
## 3 481.7867 4252.924 113.0603 184.6628 165.4673 215.8639 190.5513
## 4 284.1533 1477.202 101.9724 199.9513 110.4363 172.2799 151.1723
## 5 334.6466 1316.800 114.9128 204.2139 127.4630 210.0835 194.2244
## 6 415.4404 3646.593 125.2571 220.9299 145.5556 222.3194 179.5192
```

???

- Woah, look at that dimensionality!

- Expression measures (from microarrays) appear arbitrary.

---

# Identifying biomarkers

We can use the package to identify potential biomarkers:

```r
biomarkerTMLEout <- biomarkertmle(Y = Y, # biomarkers
                                  W = W, # baseline covariates
                                  A = A, # exposure (benzene)
                                  type = "exposure",
                                  parallel = TRUE,
                                  family = "gaussian",
                                  g_lib = c("SL.glmnet", "SL.randomForest",
                                            "SL.polymars", "SL.mean"),
                                  Q_lib = c("SL.glmnet", "SL.randomForest",
                                            "SL.nnet", "SL.mean")
                                 )
```

```r
design <- as.data.frame(cbind(rep(1, nrow(Y)),
                              as.numeric(A == max(unique(A)))))
colnames(design) <- c("intercept", "Tx")
limmaTMLEout <- limmatmle(biotmle = biomarkerTMLEout, IDs = NULL,
                          designMat = design)
```

???

- The procedure is quite resource-intensive as it evaluates the association of
each potential biomarker (over `$20,000$`) with the exposure of interest A

- ...while accounting for potential confounding based on the covariates included
in W

---
class: inverse, middle, center

# Visualizing results (from `biotmle`)

---

# Unadjusted results from tests

???

- still find a large number of biomarkers

---

# Adjusted results from tests

???

- multiple testing correction (FDR control)

- account for simultaneous tests

---

# Visualization I: Volcano Plot

???

- volcano plots are pretty standard

- our goal was to reduce the number of significant findings at a low fold change
in the parameter of interest. It appears that we were successful.

---

# Visualization II: Heatmap

---
class: center, middle

# Thanks!

Slides created via the R package
[**xaringan**](https://github.com/yihui/xaringan).

Powered by [remark.js](https://remarkjs.com),
[**knitr**](http://yihui.name/knitr), and
[R Markdown](https://rmarkdown.rstudio.com).

---
class: center, middle

# Me

slides: [goo.gl/HZzosu](https://goo.gl/HZzosu)

[nimahejazi.org](http://nimahejazi.org)

[stat.berkeley.edu/~nhejazi](https://www.stat.berkeley.edu/~nhejazi)

_email:_ nhejazi -AT- berkeley -DOT- edu

_twitter:_ [@nshejazi](https://twitter.com/nshejazi)

_GitHub:_ [nhejazi](https://github.com/nhejazi)