My research interests have shifted in recent years from mathematical problems at the interface of stochastic processes and to random combinatorial structures, to more practical problems of academic information management. A bridge between these interests is provided by probabilistic models for random partitions and related stuctures such as random graphs which may serve as the basis of machine-learning algorithms for automated classification and entity recognition in the bibliographic universe. The importance of connecting the combinatorics of large graphs to methods of organizing the world's information is amply demonstrated by the development and industrial application of the page rank algorithm by Google. Much remains to be discovered about how best to structure and manage the deluge of academic information now available to researchers. I expect that ideas from combinatorics, probability and machine learning will continue to find important applications in this endeavor.

Combinatorial Stochastic Processes

I have long been interested in interfaces between the traditional theory of stochastic processes and other areas of mathematics, especially combinatorics. I have studied various random combinatorial objects, such as permutations, partitions, and trees, and how the asymptotic behaviour of such structures over a large number of elements can be described in probabilistic terms, most often involving Brownian motion and related processes. This has led to the study of various measure-valued and partition-valued Markov processes whose behaviour may be understood in terms of combinatorial constructions involving random trees. I have been engaged in developing various ideas related to random partitions, random trees, irreversible processes of coalescence, and their time reversals which provide models for random splitting or fragmentation. I view this line of research largely as pure mathematics, but mathematics of a concrete kind which is often motivated and influenced by applications. Stochastic models with a natural probabilistic structure typically turn up in different disguises in diverse fields. The study of their mathematical structure allows ideas and results developed in one context to be transferred to another. Following is a selection of articles in this vein over the last 4 years.

Bénédicte Haas, Grégory Miermont, Jim Pitman and Matthias Winkel

** Continuum tree asymptotics of discrete fragmentations and applications to phylogenetic models **

*Ann. Probab.* Vol. 36 No. 5, 1790-1837 (2008). [arXiv] [DOI] [MR?] [GS?]

Given any regularly varying dislocation measure, we identify a natural self-similar fragmentation tree as scaling limit of discrete fragmentation trees with unit edge lengths. As an application we obtain continuum random tree limits of Aldous's beta-splitting models and Ford's alpha models for phylogenetic trees. This confirms in a strong way that the whole trees grow at the same speed as the mean height of a randomly chosen leaf.

Bénédicte Haas, Jim Pitman and Matthias Winkel

** Spinal partitions and invariance under re-rooting of continuum random trees **

*Ann. Probab.* Vol. 36 No. 5, 1790-1837 (2008). [arXiv] [DOI] [MR?] [GS?]

We develop some theory of spinal decompositions of discrete and continuous fragmentation trees. Specifically, we consider a coarse and a fine spinal integer partition derived from spinal tree decompositions. We prove that for a two-parameter Poisson-Dirichlet family of continuous fragmentation trees, including the stable trees of Duquesne and Le Gall, the fine partition is obtained from the coarse one by shattering each of its parts independently, according to the same law. As a second application of spinal decompositions, we prove that among the continuous fragmentation trees, stable trees are the only ones whose distribution is invariant under uniform re-rooting.

Jim Pitman and Matthias Winkel

** Regenerative tree growth: binary self-similar continuum random trees and Poisson-Dirichlet compositions **

*Ann. Probab.* Vol. 37 No. 5, 1999-2041 (2009). [pdf] [Project Euclid] [arXiv] [DOI] [GS?]

We use a natural ordered extension of the Chinese Restaurant Process to grow a two-parameter family of binary self-similar continuum fragmentation trees. We provide an explicit embedding of Ford's sequence of alpha model trees in the continuum tree which we identified in a previous article as a distributional scaling limit of Ford's trees. In general, the Markov branching trees induced by the two-parameter growth rule are not sampling consistent, so the existence of compact limiting trees cannot be deduced from previous work on the sampling consistent case. We develop here a new approach to establish such limits, based on regenerative interval partitions and the urn-model description of sampling from Dirichlet random distributions.

Peter McCullagh, Jim Pitman and Matthias Winkel

** Gibbs fragmentation trees **

*Bernoulli* Vol. 14 No. 4, 988-1002 (2008). [arXiv] [DOI] [MR?] [GS?]

We study fragmentation trees of Gibbs type. In the binary case, we identify the most general Gibbs type fragmentation tree with Aldous's beta-splitting model, which has an extended parameter range β>-2 with respect to the Beta(β+1,β+1) probability distributions on which it is based. In the multifurcating case, we show that Gibbs fragmentation trees are associated with the two-parameter Poisson-Dirichlet models for exchangeable random partitions of bN, with an extended parameter range.

Jomy Alappattu and Jim Pitman

** Coloured loop-erased random walk on the complete graph **

*Combinatorics, Probability and Computing* Vol. 17, 727-740 (2008). [.pdf] [arXiv] [MR?] [GS?]

Starting from a sequence regarded as a walk through some set of values, we consider the associated loop-erased walk as a sequence of directed edges, with an edge from i to j if the loop erased walk makes a step from i to j. We introduce a coloring of these edges by painting edges with a fixed color as long as the walk does not loop back on itself, then switching to a new color whenever a loop is erased, with each new color distinct from all previous colors. The pattern of colors along the edges of the loop-erased walk then displays stretches of consecutive steps of the walk left untouched by the loop-erasure process. This paper describes explicitly the distribution of the sequence of random lengths of differently colored segments of the loop-erased walk, and yields asymptotic descriptions of these random lengths as N tends to infinity.

Berestycki, Nathanaël and Pitman, Jim

** Gibbs distributions for random partitions generated by a fragmentation process **

*J. Stat. Phys.* Vol. 127 No. 2, 381--418 (2007). [arXiv] [DOI] [MR] [GS?]

In this paper we study random partitions of 1,...n, where every cluster of size j can be in any of w_j possible internal states. The Gibbs (n,k,w) distribution is obtained by sampling uniformly among such partitions with k clusters. We provide conditions on the weight sequence w allowing construction of a partition valued random process where at step k the state has the Gibbs (n,k,w) distribution, so the partition is subject to irreversible fragmentation as time evolves. For a particular one-parameter family of weight sequences w_j, the time-reversed process is the discrete Marcus-Lushnikov coalescent process with affine collision rate K_i,j=a+b(i+j) for some real numbers a and b. Under further restrictions on a and b, the fragmentation process can be realized by conditioning a Galton-Watson tree with suitable offspring distribution to have n nodes, and cutting the edges of this tree by random sampling of edges without replacement, to partition the tree into a collection of subtrees. Suitable offspring distributions include the binomial, negative binomial and Poisson distributions.

**Bibliographic Knowledge Network**

In 2008 I organized submission of a collaborative proposal to the NSF Cyber-enabled Discovery and Innovation (CDI) Program to develop a suite of tools and services to encourage formation of virtual organizations in scientific communities of various sizes, such as conference groups and departmental research groups, and allow such organizations to filter out relevant documents from various input streams, select and enhance the quality of bibliographic data associated with the organization, and attract students, teachers and researchers to contribute to activity of the organization. The main partners are the American Institute of Mathematics, a group at the Institute of Quantitative Science at Harvard University, and a group in Berkeley Statistics. Some further individuals involved are Hadley Wickham at Rice and Jon Willinsky at Stanford.

This project was funded in September 2008, and work has been proceeding since then. See the Project Website for further details.

The project addresses three fundamental problems of knowledge management:

- The
**compartmentalization problem**(how to break down barriers which separate disciplines) - The
**maintenance problem**(how to provide incentives for individuals and organizations to improve the quality of publicly accessible knowledge). - The
**navigation problem**(how to guide students and researchers within and between disciplines)

The idea is to solve these problems by gradually distilling the wealth of heterogeneous data now available in digital formats into an openly navigable network of websites, the Bibliographic Knowledge Network, each node of which is a website dedicated to a specific topic or field of knowledge. Each participating site will be maintained by some individual or Virtual Organization with a commitment to that field. Sites may be designed as guides for researchers, teachers, and students, or they may provide more specialized services, such as gateways to connect other internet resources.

Accomplishments of the project todate have been convergence on the specification of BibJSON as a bibliographic data format, success with the use of CouchDB as a bibliographic database compatible with the project needs, and development of a preliminary system of bibliographic web services. I have also been working with Berkeley students Jeff Regier, James Long and Fei Yu on various aspects of machine-aided disambiguation and matching of various bibliographic entities, especially authors and subjects.