
The Phasing Problem
The human genome contains two copies of every gene…most of the time. Personal genetics products (23andMe, Navigenics, deCODEme) reflect this by the fact that two alleles (A, T, C or G) are reported at each SNP site (identified by an rs# –> ex. rs1234). 23andMe customers who have downloaded their raw data file (around 15 megabytes of letters and numbers) will have seen this in their genotype column.
While the microarray platforms employed by direct to consumer genetics companies are capable of determining both variants (one on each chromosome) inherited by an individual at each locus, these microarray platforms are unable to assign each variant to a specific chromosome. Let me explain:
Lets assume that four different positions were analyzed on chromosome 1: positions 1-4. For each position, two alleles were determined by a microarray (the technology used by 23andMe and friends). The data would appear as follows:
Position Genotype
1 AG
2 CT
3 CT
4 CC
The problem with this data is that we do not know which nucleotide (A,T,C,G) belongs to which chromosome. One possible arrangement is as follows:
Position ChromosomeA ChromosomeB
1 A G
2 T C
3 C T
4 C C
However, we do not know if this arrangement is correct since there is no information about whether or not certain alleles are linked together.
The ability to distinguish which alleles belong to which chromosome is important when considering how genes are inherited. Generally, a parent passes one of the two copies of each chromosome on to their offspring. While the two chromosomes might both contribute genetic information via a process called recombination, the genes received by a child are typically “linked” and inherited together since they are located on the same chromosome.
To determine which genes of yours are linked together (and therefore likely to be inherited together by your child), it is first necessary to figure out which alleles (indicated by the variant SNPs) exist together on the same chromosome. This process has been termed “phasing” in the bioinformatics world.
Phasing: Simple in Theory, Complex in Practice
I just lied, phasing is not a simple process, not even in theory. It has its roots in computer science and statistics, and it is typically accomplished by employing Markov-chain Monte Carlo algorithms, which I won’t even attempt to explain here (people with many more letters after their name give courses on this). Phasing is also a very computing intensive process. Moreover, it takes a long time, even with our super, duper, post millenial computers. Current phasing protocols have been developed for the analysis of large population datasets. This is troublesome for the individual consumer since researchers have very little reason to tailor phasing protocols to the needs of an individual. However, the ability to phase your own genome would provide you with valuable information about which genes of yours are likely to be inherited together in your children.
To phase chromosomes, researchers often rely on family duo and trio data (data from a parent and a child, or both parents and a child) to help get more accurate results. However, phasing can also be accomplished by aggregating the genotype data from unrelated individuals (of the same, or similar, ethnicities for more accuracy). By phasing population data, researchers identify haplotypes, which are essentially segments of DNA that are common to a particular ethnic group. A number of freely available (through academic license) programs exist to help you analyze population data to determine phase: PHASE, fastPHASE, and ShapeIT all accomplish this task (though at varying speeds and with different levels of accuracy).
I am currently interested in “Personal Genome Phasing” as a means of modeling inheritance across generations. Not immediately, but in the not too distance future, I expect I will have a post on how one might accomplish the task of phasing your own genome (this will likely require you to have information from your parents/children/siblings as well as your own). After I have properly demonstrated my preferred phasing method, I will move on to SNP imputation!
Popularity: 15%




2 Responses to “Phasing: Determining Which SNPs are Inherited Together”