Fundamental concepts in genetics

 

Key objectives

 

Explanation of genetic terms

 

Chromosomes

 

A chromosome contains two complementary strands of dexoyribonucleic acid (DNA). These are long polymers of nucleic acids (nucleotides) each consisting of phosphate, deoxyribose and one of four "bases" which consist of: adenine, thymine, guanine and cytosine (A, T, C and G). These always form base pairs based on hydrogen bonds between complementary bases: A-T or C-G. (The two strands are termed anti-parallel, in that they "run" in opposite directions.) (RNA differs from DNA in that the sugar consists of ribose rather than deoxyribose, and the base uracil (U) is used instead of thymine.)  

 

Meiosis

 

This is the process whereby gametes are produced for sexual reproduction. Normal cells have pairs of homologous chromosomes - 22 pairs of autosomal chromosomes and two sex chromosomes (XX or XY). (For each pair, one chromosome will have been inherited from the father and one from the mother.) Cells containing pairs of chromosomes are termed "duploid". The process of meiosis results in gametes being formed which are haploid - they only contain one member of each pair. As shown in the diagram, during meiosis some cross-overs occur between homologous chromosomes so that each chromosome found in the gamete consists of a "mixture" of material from both members of the pair.  

 

 

 

(In males, only a limited amount of such exchange of DNA occurs between X and Y chromosomes - this is at the pseudoautosomal region which is at the end of the short arm of the chromosome. There is always one crossover in this region, so that of the four gametes formed in meiosis two will contain sex chromosomes in which a crossover has occurred.)  

 

Mitosis

 

This term refers to the asexual reproduction of a somatic cell. It results in two duploid cells having the same chromosomal makeup as the original cell.  

 

Genes

 

In physical terms, genes are the parts of chromosomes containing stretches of DNA which actually code for proteins. However the concept of a gene predates the understanding of DNA and the physical basis of the inheritance of characteristics. Mendel showed that it was possible to make some observations about the nature of the mode of transmission of information from parents to children. A gene could be thought of as a discrete unit of information influencing inherited characteristics.  

 

In some contexts the terms "genetic" and "physical" are contrasted. "Physical" refers to an understanding of what is occurring at the level of the DNA molecule, whereas "genetic" refers to an understanding based on patterns of inheritance. This is in some ways analogous to the relationship between anatomy and physiology, structure and function. Since genetic events depend on physical ones there should obviously be a close relationship between the two. We can speak of genetic and physical maps, or another example is that in linkage analysis the genetic phenomenon of recombination is based on the physical phenomenon of cross-over during meiosis.  

 

 

 

As shown in the diagram, the length of DNA forming a gene is transcribed to messenger RNA (mRNA) and this is translated into a polypeptide, which after further modification will result in the formation of a protein. Three bases (nucleic acids) code for each amino acid, and there is some redundancy in this coding in that more than one triplet (or "codon") can code for the same amino acid. Most genes contain some areas which are not transcribed. These are called introns, and the stretches of DNA which they lie between and which are used for transcription are called exons and form the coding regions of the gene. The lengths of mRNA transcribed from the exons are spliced together after transcription to form one strand of mRNA.  

 

Normal human nuclei contain 22 pairs of autosomes and a pair of sex chromosomes (XX or XY). The human genetic sequence consists of 3,000,000,000 base pairs coding for 20,000 genes.  

 

Genetic polymorphism

 

Within a species, one chromosome of a given type is almost identical to another, but at some places (loci) on the chromosome there may be some variability in the DNA sequence between chromosomes. This variation in sequence is termed allelic polymorphism. Alleles are detectable variations occurring at a single genetic locus, with the possibility that individual chromosomes may differ in the allele each has at a given locus. Alleles are detected using genetic markers. Where allelic variation is frequently found (say that at least 10% of chromosomes have an allele other than the most commonly occurring one) one refers to "a polymorphism". If variation is rare one is more likely to speak of "a mutation" (especially if it results in disease) or "a variant". (Note that the term mutation is also used to describe the process whereby new modifications to DNA sequences occasionally occur, and in evolutionary terms it was these mutation events which resulted in the genetic polymorphism we observe today.)  

 

The term genetic marker can be used very broadly to apply to any observable variation which results from a variation at a single genetic locus. This may be a clinical trait such as colour-blindness or G6PD activity, a serological marker such as ABO blood group, or a DNA marker whereby molecular genetic techniques are used to identify variations in DNA sequences directly. (The DNA markers are sometimes contrasted to the "classical" genetic markers which until a few years ago were the only ones available for use.)  

 

The term phenotype refers to characteristics which can be observed, whereas the term genotype refers to the underlying alleles at a locus which determine or influence phenotype. There may not be a onetoone correspondence between phenotype and genotype, for example if we denote the alleles of the ABO blood group as a, b and o then the following correspondence exists between underlying genotype and observed phenotype (in this case blood groups):  

 

ab => AB  

aa, ao => A  

bb, bo => B  

oo => O  

 

For DNA markers it is generally possible to determine genotype directly from phenotype (i.e. the laboratory observations) so the distinction becomes less relevant. Such markers are termed codominant, meaning that characteristics of both alleles are expressed (i.e. detectable).  

 

For autosomal chromosomes (= not sex), and for the X chromosome in females, there are two copies of each chromosome and hence two alleles of each marker. If these alleles are the same one says that the subject is homozygous for this marker, if different the subject is heterozygous (or "a heterozygote").  

 

If a marker has a large number of different alleles then it is termed highly polymorphic and in genetic terms is spoken of as highly informative. A measure of the informativeness of a marker is the PIC value (polymorphism information content) - the probability of a meiosis being informative for linkage, or the heterozygosity - the probability of an individual being heterozygous.  

 

Note that in physical (i.e. molecular) terms a genetic locus is quite a big thing - for example a locus may consist of a gene containing several thousand bases. This means that there may be more than one marker at the same genetic locus, and so strictly speaking one should talk about the alleles of a marker rather than of a locus.  

 

DNA markers

 

The four main types of DNA marker are restriction fragment length polymorhisms (RFLPs), variable number of tandem repeat polymorphisms (VNTRs), microsatellite polymorphisms based on di-, tri- or tetra-nucleotide repeats and single nucleotide polymorphisms(SNPs).  

 

RFLPs were developed first. The polymorphism consists of the presence or absence of a restriction site for a bacterial restriction enzyme. This is an enzyme which breaks strands of DNA wherever they contain a certain sequence of half-a-dozen or so nucleotides. (Different enzymes recognise different restriction sites.) The locus of interest could be "probed" using a radiolabelled piece of DNA with the same sequence as part of the test locus. This would selectively hybridise to the restriction fragment derived from the test locus. The whole process (see diagram) consisted of:  

 

(This process was termed Southern blotting, after Dr Southern.)  

 

If a restriction site adjacent to the test locus is sometimes present and sometimes absent, then the fragment in which the test locus is found will vary in size and form a polymorphic genetic marker. Whether or not a restriction site is recognised by the enzyme can depend on a single base pair change in the DNA sequence. These restriction site polymorphisms occur at many sites compared to the limited number of classical genetic markers which were available. They often occur in non-coding regions ("junk" DNA) so that the variation has no biological effect, but acts only as a marker. RFLPs ushered in the era of the "new genetics".  

 

 

 

Nowadays RFLPs can be detected by other, more efficient methods, typically consisting of PCR amplification (see below) followed by digestion with the restriction enzyme and visualisation of fluorescently labelled fragments following electrophoresis.  

Another DNA polymorphism which can be used as a genetic marker consists of a variable number of tandem repeats (VNTR). In non-coding regions there may be certain sequences of DNA which are repeated several times, and the number of times such a sequence is repeated may vary. This will produce variation in the size of the restriction fragment containing these repeats, which again could be detected by Southern blotting.  

 

A DNA polymorphism which is now much more widely used is the microsatellite repeat. This consists of a variable number of repetitions of a very small number of base pairs (di-, tri- or tetra-nucleotide repeats) often consisting of cytosine and adenosine (CA-repeats). These repeat sequence polymorphisms are detected by the polymerase chain reaction (PCR). PCR can be used to produce large numbers of copies of a specified small region of DNA containing the test locus - a process called "amplification". The specificity of the reaction is defined by a pair of oligonucleotide primers, each about twenty bases long, which match the sequence at either end of the region to be amplified (which would contain the microsatellite repeat). As shown in the diagram, repeated cycles of denaturation, annealing and elongation result in an exponential increase in the number of copies of the region amplified (indeed it is even possible to amplify a region from one single molecule of DNA). The new copies can be radioactively or fluorescently labelled during the PCR process, and they are so small that when they are size-separated using electophoresis the size differences due to variations in the number of repeats contained in the fragments are readily detectable. These microsatellite polymorphisms have become very popular because: tiny quantities of DNA are used; alleles can be read very reliably; there are very large numbers of polymorphic loci; often there are large numbers of alleles, so that the markers are highly informative.  

 

 

 

Recently, interest is focussing on the possibilities for using SNPs (single nucleotide polymorphisms) especially in association studies. These consist of changes in a single base pair at a particular point. The change is either present or absent, so the markers are biallelic and hence not very informative. However they are extremely numerous, being densely present throughout the genome, and so may offer more potential for fine-mapping disease genes than microsatellite markers. They are generally detected by different PCR-based methods.  

 

It is now routine practice to determine the actual sequence of DNA which a subject possesses at a particular locus. This involves PCR amplification in the presence of labelled nucleotides which are visualised as electrophoretic bands, allowing the base sequence to be read directly. However this process is expensive and time-consuming compared to genotyping standard markers or testing for known mutations.  

 

Genetic models of disease

 

A number of terms are used to describe how the susceptibility to a disease (phenotype) is related to underlying genotype. If the main effect of a disease locus depends on having only one copy of the abnormal allele, the disease is dominant. If two alleles are required (i.e. the subject must be homozygous for the disease allele) the disease is recessive. If having one disease allele increases the susceptibility, but having two alleles increases it further, then we can speak of a "partially dominant" effect. Another way of putting this is that degree of dominance tells us how closely the phenotype of heterozygotes resembles the phenotype of homozygotes for the abnormal allele - this conception may be especially applicable to loci influencing quantitative traits.  

 

Penetrance is defined as the probability of observing a particular phenotype (often affection with a particular diagnosis) conditional on having a particular genotype. If possessing one copy of the abnormal allele (or in the case of a recessive disease, two copies) inevitably leads to development of the disease then we describe the disease as fully penetrant. If possessing the abnormal genotype does not necessarily result in the abnormal phenotype, then we refer to incomplete or partial penetrance. A subject who has an abnormal genotype but has a normal phenotype and who passes the disease allele on to their offspring may be referred to as an asymptomatic carrier. A subject with a normal genotype who has the phenotype usually associated with the abnormal genotype can be termed "a phenocopy".  

 

It may be that more than one kind of abnormal allele at a given genetic locus can produce the same disease phenotype, perhaps with some variations in severity or other clinical measures. This is referred to as "allelic heterogeneity". More than one genetic locus may influence the susceptibility to a disease. If a disease may result from abnormal genotypes at any one of a number of loci (for example these might code for different proteins involved at different stages of the same metabolic pathway) then we speak of "locus heterogeneity", or sometimes "non-allelic heterogeneity". Alternatively the phenotype may be simultaneously influenced by the genotypes at more than one locus. If only a few loci are involved we speak of oligogenic effects. There may be an additive effect between the loci, or one locus may modify the effect of another (epistasis). If more than a few loci are involved then we speak of "polygenic" effects, and these are generally assumed to be additive, with each locus making a small contribution to the overall phenotype. The term "single major locus" is used to refer to a locus where the genotype has a relatively large influence over the phenotype, even though there may be background variance due to the action of other genes and/or environmental factors. Only loci with a major effect can be detected by linkage analysis.  

 

Sometimes the term "Mendelian" is used to describe traits which have a simple pattern of inheritance which follows the rules set out by Mendel. These traits are determined by just one genetic locus, with complete penetrance and no phenocopies. If more than one locus can be involved, if penetrance is incomplete, or if phenocopies can occur then the trait is said to have complex inheritance.  

 

Methods of investigation of genetic traits

 

Family studies

 

These look at the risk of different diagnoses (phenotypes) in the relatives of affected subjects (probands). They may provide some information about the genetic relationship between different diagnoses, but do not reliably distinguish between genetic and environmental transmission.  

 

Twin studies

 

Twin studies assume that monozygotic (identical) and dizygotic (fraternal) twins share environmental influences to the same extent. However dizygotic twins only share half their genes with their cotwins. Therefore if monozygotic twins are concordant for (i.e. are both affected by) a particular disease more often than dizygotic twins, this is said to imply that there is a genetic effect. This can be described quantitatively by the concordance ratio between monozygotic and dizygotic twins.  

 

Adoption studies

 

Adoption studies seek to distinguish genetic from environmental effects by studying whether children more closely resemble their biological than adoptive parents. The assumption is that only genetic factors are transmitted from biological parents (though this may not be strictly true).  

 

Offspring of discordant monozygotic twins

 

An interesting extension of the twin method which has been applied to schizophrenia is to study the children of monozygotic twins who are discordant for disease (i.e. one twin is affected and the other is not). In such a case one would not expect familial environment to account for similarities between the two sets of children. If both sets of children have the same increased risk of affection this is suggestive of a large genetic contribution to aetiology.  

 

Segregation analysis

 

This seeks to determine the genetic mode of transmission of a disease by observing the pattern of transmission through families (note that twin and adoption studies do not provide this information). Is there evidence for a single major locus, and if so is it dominant or recessive, autosomal or X-linked, and is it fully or partially penetrant? The pattern of the disease within families is compared to that which would be expected under a variety of genetic models. Segregation analysis often does not provide definite results unless the mode of transmission is very obvious.  

 

Genetic marker studies

 

There are two kinds of genetic marker studies which aim to localise genes influencing susceptibility to an illness: association and linkage.  

 

Association studies

 

 

 

In association studies many unrelated affected individuals are studied and association is said to be present if a particular allele is present more often than in the general population or in a matched sample of unaffected controls. This may because a genetic locus influencing affection may lie very close to the marker. However false positive associations can occur if cases and controls are inadvertently drawn from different populations which have different frequencies for the marker alleles.  

 

Linkage studies

 

 

 

In linkage studies related individuals are studied, either siblings or extended pedigrees. Linkage is present when the alleles of a marker tend to cosegregate with a disease within a family (but different alleles of the marker may cosegregate with the disease in different families, see diagram).  

 

If we consider two different polymorphic markers, which may be linked, then we may be able to identify which alleles a subject has inherited from their father and which from their mother. The two paternal alleles can be referred to as a haplotype. If the two markers are at loci which are (in physical terms) close together on the same chromosome then (in genetic terms) they may demonstrate linkage. This means that haplotypes will tend to be passed on intact, and the children of the subject will inherit either the maternal haplotype or the paternal haplotype, rather than a new haplotype consisting of a mixture between paternal and maternal alleles. In physical terms this is because when two loci are close together on a chromosome it is unlikely that a crossover will occur between them at meiosis. If two alleles at different loci have been inherited together but are not passed on together to an offspring then this is termed recombination. If both loci are on the same chromosome then recombination occurs when there is an odd number of crossovers between them at meiosis. The proportion of meioses in which recombination occurs between two markers is called the recombinaton fraction. When two markers are at loci which are extremely close on the same chromosome then there will hardly ever be a crossover between them and the recombination fraction will approach zero - this situation can be referred to as "tight linkage". If the two loci are distant or are on different chromosomes then the alleles of one marker will be inherited at random with respect to the other and the recombination fraction will be 50%. Two loci which are linked have a recombination fraction of less than 50%.  

 

The strength of evidence in favour of linkage is conventionally measured as the lod score. This is the logarithm base 10 of the ratio of two likelihoods - the likelihood assuming linkage and the likelihood assuming non-linkage. For a given set of genetic data, it is possible to calculate the likelihood of the observed data being produced given a set of assumptions about the parameters of the genetic model - the allele frequencies of the loci involved, relevant penetrance values, and the recombination fraction between loci. This recombination fraction is generally referred to as theta. The likelihood for the null hypothesis (non-linkage) is calculated with theta being set to 50%. Then the likelihood for an alternative hypothesis of linkage is calculated with theta set to a different value, less than 50%. The log of the ratio of these two likelihoods is the lod score (Log of ODds). If we denote the likelihood for a given set of data conditional on a certain value of theta as L(theta), then the formula for the lod score is:  

 

lod(theta) = logL(theta)/L(0.5)  

 

In practice a table of lod scores is calculated for different values of theta ranging from 0 to 50%. (At 50% the lod score is always 0, since it is the logarithm of a ratio equal to 1.) The maximum likelihood estimate of the true value of theta is the one which generates the highest lod score. If two loci are unlinked then they will often have negative lod scores, especially at small recombination fractions, so that it is possible to generate evidence against linkage - exclusion of linkage. Conventionally, lod scores less than -2 are taken as evidence against there being linkage at the relevant value of theta.  

 

Linked loci will usually produce positive lod scores. The traditional criterion for accepting evidence of linkage is that the maximum lod score should exceed 3, but there are very complex problems in interpreting the meaning of lod scores. The value of 3 implies that the ratio between the likelihoods for the alternative and null hypotheses should exceed 10 to the power of 3, or 1000. However this is not the same as saying that the significance level or "p value" is < 0.001. Although people often refer loosely to a lod of 3 as being "significant" the associated p value is not 0.05, in the sense that results such as this will be produced by chance on one occasion out of twenty. Because the lod score measures a likelihood ratio, it does not actually translate directly into a p value at all. The threshold of 3 was originally chosen because it was calculated that if such a threshold were used then roughly 5% of lod scores over 3 would be false positives - i.e. would have arisen by chance - whereas the other 95% would reflect true linkage (these calculations were probably flawed, but the convention remains). The actual significance of a lod score of over 3 (the probability of such a score occurring by chance) may be in the region of 0.0002. (It is worth bearing this figure in mind when assessing the weight of evidence in favour of linkage produced by other methods, such as sib pair studies, which do produce a p value rather than a lod score.)  

 

Of great importance in psychiatric genetics is that the weight accorded to a given lod score should be reduced when multiple genetic models are tested. Clearly, the more multiple testing is performed and the more "degrees of freedom" are entered into the model (for instance by varying penetrance parameters), the easier it will be to achieve a lod of 3 by chance and the weaker is the associated "significance" of the results. Because of these considerations, and because of a couple of spectacular false positives, lod scores of the order of 3-6 are taken nowadays more as suggestive rather than as providing definite evidence of linkage. The more flexibility there is in the genetic models used to produce the lod score, the more conservative should be the interpretation of the results.  

 

A difficulty with the application of the lod score method to psychiatric disorders (and other disorders with complex inheritance) is that all the parameters of the genetic model must be fully specified (i.e. gene frequency, penetrance, phenocopy rate, etc.), even though in reality these are not known. Misspecifying them may produce false negative results, while trying a range of different models produces problems of multiple-testing. Because of the requirement to specify these parameters the lod score method is described as "parametric", and to avoid this requirement non-parametric methods have been proposed. These consist generally of observing whether the affected members of a pedigree tend to inherit the same alleles of a marker. The most commonly used nonparametric methods examine sib pairs for increased allele-sharing, but newer methods also look at allele-sharing between other pairs of relatives. All these nonparametric methods are less powerful than the lod score method.  

 

Genetic distance

 

The recombination fraction between two loci depends on how far apart they are in physical terms along the DNA molecule. The exact relationship between recombination and varies between different regions of different chromosomes, and also between males and females. As a rough guide, 1% recombination is equivalent to genetic distance of 1 centimorgan and a physical distance of 1 million base pairs (1 megabase). The relationship between recombination and distance does not remain linear, since when two loci are an infinite distance apart the recombination fraction is still only 50%. Recombination fraction is converted into distance using one of a variety of mapping functions, e.g. Kosambi, Haldane. Chromosomes vary greatly in length, but are of the order of magnitude of 100 centimorgans long.  

 

Differences between linkage and association

 

Notable differences between the two methods are:  

 

The transmission/disequilibrium test (TDT)

 

If there are hidden stratifications within the population association studies can give false positive results for markers which do not lie close to the disease gene. To guard against this happening a number of tests which use relatives as controls have been proposed, the most popular of which is the TDT.  

 

In the TDT a sample of cases is collected along with their parents, all of whom are genotyped for a marker. The heterozygous parents are examined. By chance, there is a 50% chance of them transmitting either allele of the marker to their affected offspring. However if one marker allele is associated with the disease then this allele will be transmitted on more than 50% of occasions.  

 

 

 

The A allele is transmitted to affected offspring four times out of five.  

 

Haplotype analysis

 

Each marker has two alleles, one on each chromosome of a homologous pair. If we consider a group of markers near to each other then all the alleles on one chromosme form one haplotype and the alleles on the other chromosome form another haplotype. One cannot tell by observing the genotypes how the alleles are grouped into haplotypes but there are techniques for estimating haplotypes. Theoretically, considering groups of markers together and their haplotypes may be more informative than considering each marker individually and multi-marker analyses may be better able to detect association with disease. On the other hand the methods are more complex and more prone to different kinds of error, yielding the risk of false positive results.  

 

Copy number variants

 

Sometimes a stretch of one chromosome may be absent, termed a deletion, or may be repeated, termed a duplication. In the normal case, there are two copies of each sequence of genetic code, one on either chromosome of a homologous pair. However if there is a deletion then only one copy will be present and if there is a duplication three copies will be present. Studies may seek to determine if copy number variants are commoner in cases than controls.  

 

Genome-wide association studies

 

Because of the large range of linkage studies, a genome scan can be carried out using approximately 400 microsatellite (multiallelic) markers. These might provide evidence for linkage with a localisation to within around 10-20 centimorgans (10-20 megabases).  

 

Since around 2006, it has become possible to genotype large numbers (>500,000) of SNP (biallelic) markers to use for genome-wide association studies. Coverage from early marker sets is patchy and statistical problems arise from having to correct for multiple testing. The same marker sets can also be used for genome-wide linkage analysis.  

 

Sequencing studies

 

The actual sequence of all DNA basis can be obtained by techniques which were previously extremely intensive and expensive.  

 

Since around 2010 it has become cheaper and more practical to sequence the exome, comprising coding sequences of all genes, or the whole genome, which also includes non-coding DNA.  

 

Functional studies

 

Once sequence changes have been identified which appear to be associated with risk of a disease, functional studies can be carried out in cell systems or whole animals. These may measure the quantity of RNA transcribed or may investigate the functional properties of the protein product or the effect on the phenotype of the model organism.  

 

Updated January 2012  

 

http://www.davecurtis.net/dcurtis/lectures/introgen.htm  

 

Dave Curtis (d.curtis@ucl.ac.uk)