Model-free linkage analysis

Dave Curtis and Pak Sham, July 1995.

Curtis D and Sham PC. Model-free linkage analysis using likelihoods. Am J Hum Genet, 1995.

The classical lod score method of linkage analysis necessitates that the transmission model for a trait be specified. If the model is misspecified then artefactually negative lod scores may be produced, especially close to marker loci, and between flanking markers. Nonparametric methods deliberately discard available information concerning the pattern of segregation of disease and markers through pedigrees, and doing this inevitably loses power. Attempts have been made to make lod score analysis less model-dependent by performing multiple analyses, by treating unaffected subjects as of unknown affection status and by ignoring multipoint data, but again these techniques inevitably reduce power.

Lod score methods are least reliable at small recombination fractions because phenocopies and non-penetrant carriers not allowed for in the transmission model tend to be counted as recombinants. The method described avoids having to rely on the confounding of recombination and transmission model parameters, and seeks to test for a genetic effect at a particular test position without making prior assumptions concerning the mode of transmission.

Estimating parameters using likelihoods

A number of methods of genetic analysis utilise maximum likelihood methods to estimate parameters and to provide support for competing hypotheses. In order to calculate the likelihood for a given dataset, values for all parameters of the genetic model must be provided.

Parameters for a diallelic susceptibility locus

If we consider a two-allele locus which may influence susceptibility to disease, we need a number of parameters to describe its effect. Arbitrarily taking one of the alleles as the susceptibility allele, we denote its frequency as q and define the probability of affection conditional on having 0, 1 or 2 copies of this allele as f0, f1 and f2. For convenience we can treat these three penetrance values as a vector having three elements, denoted F. The penetrance vector, F, and the allele frequency, q, define the transmission model, termed T. We can denote the tranmission model for a locus having no effect on susceptibility as T0, where f0=f1=f2 (or perhaps q=0 or q=1, implying the locus is monomorphic).

This susceptibility locus may be at a certain genetic distance relative to marker loci. Where only one marker is used, this distance is measured by the recombination fraction, theta, but for multipoint analyses the distance will be a function of two or more recombination fractions. Locus heterogeneity means that there is a second locus which influences susceptibility to the disease, and this locus is unlinked to the marker locus or loci and is usually assumed to have the same mode of transmission as the linked locus. The proportion of families segregating the disease due to the linked locus is termed alpha.

Segregation analysis

In segregation analysis the likelihood is maximised over different values of the transmission model parameters. Although T is represented diagrammatically as a unidimensional parameter, in fact it comprises the three penetrance parameters and the allele frequency. If marker genotypes are available then they are not regarded as relevant for segregation analysis, and to represent this in the model the recombination fraction between the disease and marker loci can be fixed to 50%.

A likelihood-based test for the existence of a genetic locus affecting susceptibility to disease may be based on the following likelihood-ratio:

LR = L(D | T) / L(D | T0) 
   = L(D | f0,f1,f2,q) / L(D | f0=f1=f2 or q=0 or q=1)

Here D represents the observed pattern disease in the pedigree data set, and the parameters of the transmission model T are chosen to maximise the likelihood for these data. In practice there are problems with applying this approach because it is difficult to allow for the affects of selection bias and for other possible causes of familial aggregation.

If we wished to write a fuller formulation which could also take account of marker data, denoted M, then we could do this by fixing the recombination fraction to 50%:

LR = L(D,M | f0,f1,f2,q,theta=0.5) / L(D,M | f0=f1=f2 or q=0 or q=1,theta=0.5)

Classical linkage analysis

In classical linkage analysis the transmission model is fixed (possibly with parameter values obtained from segregation analysis) and the likelihoods of the disease and marker data are compared under the null hypothesis of no linkage and the alternative hypothesis of linkage:

LR = L(D,M | theta<0.5,{q,F}) / L(D,M | theta=0.5,{q,F}) 

Here, the notation {q,F} indicates that the susceptibility allele frequency and penetrance values are fixed in advance and are the same in the numerator and denominator. The log base 10 of the likelihood ratio, log(LR), is the standard lod score.

Linkage analysis incorporating heterogeneity

One can easily extend classical linkage analysis to test the hypothesis that a susceptibility locus for the disease may be linked to the marker, but may not necessarily be operative in all families. The proportion of families which are linked is denoted alpha, and the null hypothesis can then be represented either by theta=0.5 or alpha=0.

LR = L(D,M | theta<0.5,alpha>0,{q,F}) / L(D,M | theta=0.5 or alpha=0,{q,F}) 

As a test for linkage in the presence of heterogeneity one may use the lod2 statistic, log(LR). This test has two free parameters, but the exact statistical interpretation of this is complicated by the fact that theta and alpha are confounded under the null hypothesis.

Testing for an effect on susceptibility at a given position

In order to produce the desired model-free test for the presence of a susceptibility locus at a given position, denoted theta=t, two modifications are made. Firstly, theta is fixed in both the denominator and numerator so that the only free parameter with respect to linkage is alpha, the proportion of families linked. Secondly, the parameters of the transmission model are freed in both the numerator and the denominator. The likelihood is maximised over transmission model parameters independently under the hypotheses of both linkage and non-linkage:

LR = L(D,M | alpha>0,theta=t,q,F) / L(D,M | alpha=0,theta=t,q,F) 

Because there is one more free parameter in the numerator than in the denominator, log(LR) provides a test with one degree of freedom and is comparable with a standard lod score.

Constraining transmission model parameters

Although the test as described above is valid, in some circumstances it will have very little power. If a sample of affected sibs is used then the maximum-likelihood values for the transmission model parameters under both linkage and nonlinkage will be f2=1, q=1 or f0=1,q=0 or f0=f1=f2=1. If the sample consists entirely of affected subjects then the transmission model parameters will be chosen to reflect this. If values such as these are arrived at under the hypothesis of nonlinkage then it is impossible for parameters relating to linkage to affect the likelihood of the data and the sample would be completely uninformative for linkage. The problem may be less extreme for other sample types, but in general pedigrees selected for linkage analysis will be multiply-affected and the evidence in favour of linkage may be masked by transmission models in which the risk of the disease is overestimated.

Constraining to produce correct prevalence

Although selection bias will not produce artefactual evidence for linkage, it may yield unrealistic estimates of the transmission model parameters which will prevent the detection of linkage. In an attempt to counteract this, we can impose the constraint on these parameters that they must yield the correct population prevalence for the disease, K. We can denote this constraint as [q,F], and rewrite the test as follows:

LR = L(D,M | alpha>0,theta=t,[q,F]) / L(D,M | alpha=0,theta=t,[q,F]) 

Again, q and F may take different values under the hypotheses of linkage and nonlinkage to maximise the numerator and denominator independently. If we impose the additional constraint f0<=f1<=f2 (so that the heterozygote penetrance lies between that of the homozygotes), then we can draw a polyhedron which encloses all possible values for F:

This polyhedron is defined by the constraints 0<=f0,f1,f2<=1, f0<=K, f0<=f1<=f2 and f2>=K. It has a vertex at the point (K,K,K), corresponding to T0, which models the locus having no effect on susceptibility. At all other points there is single value for q which produces the correct value of K, so that q becomes a function of F.

Further constraints on model parameters

In order to make the procedure less computationally demanding one can avoid having to perform multidimensional maximisation of the likelihood over f0, f1 and f2 simultaneously by restricting consideration to a smaller subset of models: those represented in the figure by the dotted lines joining the Mendelian recessive model, at (0,0,1), through the null effect model, at (K,K,K), to the Mendelian dominant model at (0,1,1). It can be seen that these lines pass close to most points within the allowable volume, and so testing only these models may be expected to be adequate for the purposes of carrying out the proposed linkage analysis. Additive models are not tested explicitly, but in the context of linkage analysis an additive model may behave in a very similar way to one of the dominant or recessive models whch is considered. If only the models indicated are used, then f0 and f2 both become functions of f1 and transmission model is completely defined by the choice of f1.

Practical implementation

This analysis can be implemented quite simply by setting up the appropriate data files and then calling MLINK to perform the necessary likelihood calculations. A value for K is supplied and a few values of f1 are chosen - say 6 ranging from 0 to K, and a further 5 increasing to 1. For each transmission model MLINK is called to calculate the likelihood for each pedigree assuming the locus is at the test position (Ll) and assuming it is unlinked (Lu). These values are used to calculate the likelihood for the whole dataset assuming that a proportion alpha of families are linked (the likelihood for each family being alpha.Ll+(1-alpha)Lu). Under the hypothesis of linkage, the likelihood is maximised over alpha and f1, whereas under the hypothesis of non-linkage the likelihood is maximised only over f1. The logarithm of the ratio of these likelihoods provides a model-free lod score to test the hypothesis that there is a susceptibility locus at the position tested.

A program to carry out this simple procedure, called MFLINK, will shortly be available from John Attwood's ftp site at in /pub/packages/dcurtis.

The method can be applied without modification to multipoint data - again pedigree likelihoods are calculated at a position linked to the markers, which may flank the test position, and at an unlinked position. A model-free lod score based on the single parameter alpha is obtained.

When dealing with multipoint data one will wish to test at least one position in each interval. With two-point data it will normally be sufficient to test a single position fairly close to the marker locus, although this may vary according to circumstances. When dealing with large pedigrees one may wish to test more than one position, because in large pedigrees recombination fraction is less completely confounded with the admixture parameter.


The procedure has been tested on simulated data produced by a wide variety of dominant, additive and recessive models having high and low penetrances and phenocopy probabilities. Two kinds of pedigree structure were used: affected sib pairs and pedigrees consisting of at least an affected sib pair and an affected first cousin, with the option for the simulation to make additional members affected. Two- and three-point data were simulated under conditions of non-linkage, loose linkage, tight linkage and (for the three-point data) with the disease locus between the marker loci. The new method was compared with sib pair analysis and with classical lod score analysis in which the correct transmission model was specified. None of the methods produced false positive results above chance expectation. All three methods gave fairly similar performance when applied to the affected sib pairs. When applied to the pedigrees, the classical lod score method sometimes gave results which were close to those from the sib pair method, but often performed much better. The new method sometimes performed in a fairly similar fashion to sib pair analysis and (when the loci were truly linked) sometimes produced a much higher lod score, at times coming close to that obtained by the lod score analysis with the correct mode of transmission specified.


The pedigree data set used was comparable in size to the sib pair data set but provided considerably more power to detect linkage for many models. However only the classical lod score method and new "model-free" method were able to utilise this additional information. The new method can be applied to any pedigrees and does not seem to produce artefactually negative results even if the mode of transmission is unknown, and even at positions close to, or between, markers. This means that pedigrees of all types may be recruited, that some of the statistical problems of nonparametric analyses can be avoided and that one may test candidate genes and positions in intervals between flanking markers.

Dave Curtis (