Coherent Stochastic Models for Macroevolution

This is joint work with Lea Popovic and Maxim Krikun.

Brief motivation

There is a substantial literature on comparing data on different aspects of macroevolution -- the evolutionary history of speciations and extinctions -- with the predictions of simple ``pure chance" stochastic models. Available data includes

the distribution of number of species per genus
shapes of phylogenetic trees on extant species
fossil time series -- fluctuations in number of taxa over time

The fit of simple models, and of more elaborate models incorporating conjectured biological process, have been studied in these contexts. While data-motivated models are scientifically natural, a mathematical aesthetic suggests a somewhat different approach: start with a ``pure chance" model which encompasses simultaneously all the kinds of data that one might hope to find. Here are two instances of what one would like such a coherent model to provide.

Joint description of the phylogenetic tree on an extant clade of species, its extension to the tree on an observed small proportion of extinct species, and the (unobserved) entire tree on all extinct species.
Joint description of fossil time series at different levels of the taxonomic hierarchy.

(We emphasize the latter because biological literature tends to assume that a model can be applied at any level, without enquiring whether this assumption is logically self-consistent).

Outline of model

Our purpose is to present what is arguably the mathematically fundamental such model. The underlying model is simple -- a critical branching process conditioned to have $n$ lineages at the present time. Though hardly new in concept, our focus on conditioning to have $n$ lineages (for comparison with real clades on $n$ extant taxa) makes our results somewhat new in detail. To model higher-order taxa (genera, say) we start by assuming that each new species has some chance to be sufficiently different that it should be considered a new genus. The remaining details of defining genera, bearing in mind one desires monophyletic genera, can be handled in several different ways (see draft paper for details). Part of the project is to examine whether these different schemes for defining genera make a qualitative difference.

Overview of results

Our results are derived as asymptotics for large $n$, even though we envisage using them for rather small values, say $n = 20$.

We draw attention to some basic scaling results
- the $n^2$ law: that in a clade of $n$ extant species one expects order $n^2$ extinct species
- the $n$ law: that the time since clade origin or since last common ancestor is order $n$ times the mean species lifetime
- the $1/r$ law: that with probability $1/r$ there was some past time at which the number of species was at least $r$ times the present number
- the $1/n$ law: that the probability a given extinct species is ancestor to some extant species is order $1/n$
- and the constant law: that the probability that a given extant species is descendant of some other extant species has non-zero limit as $n \to \infty$ .
A ``local" description of the probability structure of large clades, which permits easy calculations
A ``loss of evolutionary history under random extinctions" calculation within our model.
Joint distribution of time back to origin of clade; last common ancestor; number of species at that time
The shape of phylogenetic trees on higher taxa becomes more unbalanced.
We compare typical fluctuation rates of taxon counts at different levels of the hierarchy
Our model has more intrinsic variability than previous models, and therefore provides a more conservative approach to infering biological mechanism (rather than ``just chance") from evolutionary history.