Danny Kaplan

For studying the effects of aging on athletic performance, a convenient source of data is the extensive records kept for road races of various lengths. The best known such races are perhaps the Boston and New York City marathons, but marathons and shorter races are held each year in many location. A given race can involve thousands to tens of thousands of participants and some of the races maintain lists of results that include the name, sex, and age of the participant as well as their time in the race. Some examples are the Cherry Blossom race held each year in Washington, DC (http://www.cherryblossom.org/results/resultsindex.htm) which has both a ten-mile race and a 5K run-walk, and the Los Angeles Marathon (http://results.active.com/pages/page.jsp?eventLinkageID=6277) which also sponsors a 5K race.

Scraping such data from the web allows one to compile a dataset with running times for tens of thousands of participants. A simple analysis is to model running time by age (with sex being an important covariate). Such an analysis shows clearly that older runners typically are slower than younger runners. However, that is not the same thing as saying that runners slow as they age. As a cohort of runners ages, there is a strong possibility that the less fit runners preferentially drop out, creating a sampling bias when trying to assess the effects of aging.

When running data is available for a span of years, the potential exists to develop a longitudinal analysis of the data, tracking a runner as he or she ages. Doing this involves identifying individual runners as they appear in successive years of data. The name, age, and sex information make this possible. A preliminary analysis of the Cherry Blossom race data indicates that in the 10-mile race, the longitudinal analysis shows that runners slow much more rapidly with age than indicated by a cross-sectional analysis, and shows that the rate of slowing itself increases as runners age.

The short races (e.g., 5K, walk-run) presumably attract less fit (or less competitive) participants than the long races (e.g., marathons). It might be instructive to study whether the age-related speed changes differ between the participants in the long and short races. Similarly, one could examine whether age-related slowing is different for runners whose performance lies at different percentiles of their age group. Another analysis could examine how the probability that a runner will participate in future races depends on his or her age and performance in past races. Beyond the important covariate of sex, potential covariates include the weather and course conditions, and the distance of the runner from the starting line. (In the Cherry Blossom data, which includes both the "gun" time and the "net" time, there is evidence that repeat runners move closer to the start line from year to year, biasing the results from data based on the gun time.