How the TDT came to pass
Ott (1989) and Terwilliger & Ott (1992): saw merit in disaggregating the two genotypes containing a given marker allele, and reformatting the data along the following lines:
A contribution of 1 to a cell in this table corresponds to the fate of one parent's alleles, in relation to one affected child.
They called their approach haplotype-based (H)HRR, and dubbed Falk & Rubinstein's genotype-based (G)HRR. They also gave a variety of tests and reformulations of the problem whose details need not concern us. However, they were clear that at all times they were concerned with the null hypothesis of no population gametic associaton between and .
Spielman et al (1993): In this paper the TDT was invented. It came from the recognition that Ott & Terwilliger's reformulation of Falk & Rubinstein's HRR, namely, writing the data in the form
permits a test of
rather than (or as well as) a test of the null hypothesis of no gametic association. The TDT is based upon the observation than under the null hypothesis of no linkage, and the fact that all the information concerning is in b and c. In general
where A depends on penetrances and joint haplotype frequencies, and A = 0 when there is no gametic association (suitably defined).
A different approach to the TDT
In the usual analysis of TDT, the population assumptions of random mating and Hardy-Weinberg equilibrium for haplotypes are made. In this context the relevent notion of gametic association is linkage disequilibrium. (An exception to these wide-sweeping population assumptions was an analysis by Ewens and Spielman (1995) which specifically considered mixed populations). We shall take a different path and define a notion of gametic association that allows derivation of the TDT without the population assumptions of random mating and Hardy-Weinberg equilibrium.
To define this notion of gametic association we first set up some notation. Let be a marker locus with alleles and be a disease susceptibility locus with alleles as above. Then we write the population mating-type frequencies as follows
where the genotype on the left, , refers to the maternal genotype, and the genotype on the right, refers to the paternal genotype. (Writing genotypes in this way is not meant to imply that the M and D loci are linked. It indicates that, for example, the mother received her and alleles from the same grandparent). The notation is invariant under switches between and , and between and . That is
The relevant notion of no association (for me) is the following.
No mating-type gametic association: is defined to be symmetry of under switches between i and j, and between k and l. That is, for all index combinations,
Notice that this is equivalent to symmetry under switches between s and t, and between u and v.
It is not hard to show that under the assumptions of Hardy-Weinberg equilibrium , random mating and linkage equilibrium between the DS locus and the marker, the above symmetry holds. However, if that is all that is needed to make our proof work, why assume more?
What we wish to analyse with the TDT are data consisting of the marker alleles transmitted and not transmitted by parents to affected children. This data can be put in a table as follows (where the allele pairs consist of a maternal allele on the left and a paternal allele on the right):
Take note of the subscripts in the cell counts . The first two subscripts refer to the maternal alleles transmitted and not-transmitted (in that order), and the second two subscripts refer to the paternal alleles in a similar way. In most existing discussions of this material, only one parent is considered. Run the argument for this case in parallel with what follows.
In deriving the TDT we shall first of all consider a probability associated with each cell in the table. We define to be the probability of the mother transmitting and not transmitting , and the father transmitting and not transmitting , conditional on the parents genotypes at the marker locus and affectedness of the child. In order to write the necessary probability expressions more compactly, we shall write as short-hand for ``transmits marker allele '', and as short-hand for ``does not transmit marker allele ''. Using this notation we write
If we are considering the parents separately, then we use the notation
The null hypothesis of no association between parental alleles transmitted and affectedness of the child can be defined very naturally using the (or the or the ) :
It turns out that this null hypothesis is the disjunction of two related nulls, that of no mating-type gametic association and that of no linkage between and . Either of these nulls imply that our are symmetric in ij and kl.
To show this we prove the following in the case of a single disease susceptibility locus (where and are the recombination fractions between the marker locus and the disease susceptibility locus for mothers and fathers respectively). Fix i,j,k and l.
Proposition. Assume either no mating-type gametic association or .
Proof: First we define some terms , related to the terms, as follows:
What we will eventually prove is that either no mating-type gametic association or implies that . This gives us the result we want because it is easy to show that
Exercise 4: Check this result.
Now we derive an expression for which allows for comparison between , etc. This will involve expanding the expression to include the disease locus, and then expanding further with conditional probabilities. Then we shall use two assumptions to simplify the expansion into the form we want. These assumptions are frequently used in the literature without being spelt out. Below is the definition of followed by an expansion which includes all the possible joint genotypes at the disease and marker loci.
In the next step we expand this expression to include the different transmission possibilities involving the alleles at the disease locus. We also expand the notation so that means ``transmitted allele i at the marker locus, and transmitted allele s at the disease locus''. The redundant ``not transmitted'' terms are omitted.
In the next step we split up each of the four probabilities above into products of conditional probabilities. We'll just show it for the first!
We examine each of the three probabilities in the derived expression. The first probability is the mating-type frequency . The second probability is a simple expression involving the recombination ratios and . But to write it we need to use the following assumption that we've been implicitly using all along.
Assumption 1: No segregation distortion.(cf. assumption G2 in week 6 notes).
This means that parents are equally likely to transmit either of their two alleles at a locus. This is extended to multilocus transmission by conditioning on presence or absence of recombination in each parent.
Then we write
assuming independence of recombination events between mothers and fathers, and no segregation distortion, respectively.
To simplify the third probability we use the following
Assumption 2: The child's phenotype is determined solely by his/her genotype at the DS locus.
Here are the parents' genotypes at the locus/loci in the bracket, is the parents' transmission at the locus/loci in the bracket and is child's genotype at the locus/loci in the bracket. (cf. assumption G3 in week 6 notes).
This enables us to write the third probability as (since we are assuming only one disease locus , so that what is transmitted at the marker locus to the child can be ignored too). Now we put all this together to get
Similarily we can derive
and so on for the other two probabilities in the expansion. Putting these together gives
To show that this gives the result we want, consider the similar expression for . It is
Under no mating-type gametic association we can swap i and j in the expression on the right, and this gives the expression for . So . A similar argument estabilishes equality also with and . Under no linkage, i.e. , the expression for becomes
This concludes the proof of the proposition.
Testing the hypothesis.
The likelihood expression.
In order to test hypotheses concerning the , we first need to obtain a likelihood for the data. We'd like to believe that the results for each set of two parents and an affected child are mutually independent, when appropriately conditioned. Then we can write the likelihood expression as a product of terms. If we have data from family in the form
where is phenotype information (in this case, that the child is affected), then we'd like to be able to split up the likelihood function into terms
This could be achieved directly by making the following sampling assumption
Assumption: Conditional independence of marker transmission between families, given marker genotypes and disease phenotypes.
where OTH means all the data for all of the OTHer families in the data set. This assumption means that data from the other families should not tell us anything about likely segregation at the marker locus if we know the parents genotypes at the marker locus and that the child is affected. This rules out using related
individuals from a pedigree.
However I don't assume this directly (how would we know?), but rather go back to the way in which the probabilities were calculated (via ) to see where we meet the need for assumptions. What we do is write the likelihood function
Next we analyse the probability
which is conditioned on the rest of the data, and see what assumptions we need to make. In the first step we simply write-out the conditional probability (leaving out the redundant stuff)...
The numerator is similar to . Indeed we can manipulate it in virtually the same way as for to get the analogue of equation () on page , which is
Each of the four terms in this expansion can be written as a product of conditional probabilities in a similar way to , but with an extra term in the product to account for . We show this for the first term (cf. equation () on page ). Writing the expression out in this way enables us to see the assumptions we need to make in order to get the desired result. The expression is
To get the forms we want for the last three terms in this expression we make the following sampling assumptions (i.e. assumptions referring to the extent to which independence between families is necessary).
Assumption A. Independence of parental genotypes between families
This assumption can be thought of as ruling out related families from the sample. (cf. S2 in the week 6 notes).
Assumption B. The child's phenotype is (still) determined solely by his/her genotype at the DS locus.
This is the sampling version of assumption 2 on page . (cf. S1 in week 6 notes).
Using these assumptions, and then the same argumentation and assumptions (1 & 2) as for gives
Finally we get
Now we go back to the expression () on page . The numerator has just been dealt with, and we note that the denomimator can be written as the sum of four probabilities expressing the four different transmission possibilities (same as for in terms of the ). Writing it like this, the terms cancel, and what is left is . This gives us the result that under assumptions 1, 2, A and B we can write the likelihood function as a product of the .
Possible statistical tests.
To simplify the story, let us suppose that we only have data on the mothers (say) of affected children. In this case simplifies to
where and .
Then the terms simplify to
The likelihood function (of r) generated by data on mothers with genotype at the marker locus is
where n is the number of mothers with marker genotype , and is the number of these mothers who transmitted to their affected child, and so .
There are various ways in which we might go about testing the null hypthesis of , e.g. a likelihood ratio test (after specifying the alternative hypothesis), a score test, a Wald test, etc. Let us try a likelihood ratio test to compare the null hypothesis of against the alternative r=0 (for example). Then we would compute
and compare the observed value to the null distributon. However, we don't know and . What turns out to be much better is to use the score test. The score statistic is
and it is computable. This is the TDT. In essence, it tests whether there are an unusually large or small number of tranmissions of i, under the null hypothesis .... a binomial test. Ideally, we would prefer to combine all such likelihoods over i and j and do a single score test in r. Unfortunately, our luck fails here, for there is no computable score test for all transmission data: the unknown quantities are play an important role as weights. However, there are a number of sensible ways to proceed, but we must stop. Read the American Journal of Human Genetics over the last couple of years for some of this research, and/or try Ex. 5.
Exercise 5: A kind of "main-effects only" model for the ratios is , where the are quantities expressing the extent to which allele i is preferentially transmitted. With this model for the , describe the MLEs of the and an overall likelihood ratio test of the null vs the alternative r=0.
Exercise 6: Discuss the extent to which we can include both parents' transmission data in a single analysis like the preceding one. Similarly, can we include transmission data for more than one affected child in the same family in an analysis like the above one?
G.H. Hardy, Mendelian properties in a mixed population, Science, vol. 27, 1908, pp.49-50.
Mourant et al, Blood groups and disease : a study of associations of diseases with blood groups and other polymorphisms; New York: Oxford University Press, 1978.
Warren J. Ewens & Richard S. Spielman, The Transmission/Disequilibrium Test: History, Subdivision, and Admixture, American Journal of Human Genetics, 57:455-465, 1995.