For those interested in clinical diagnostics of genetic diseases, the ability to use the molecular information presented within our various genomic databases is somewhat limited. If you attempt to locate the common p.Q12X mutation within the NCBI Reference Sequence of the AMPD1 gene (which causes Adenosine Monophosphate Deaminase Deficiency), you will find that the 12th codon does not correspond to Glutamic Acid (the mutation, according to the current RefSeq gene (NM_000036.2), is actually p.Q45X) . For researchers and clinicians looking for SNP primers or probes for specific mutations, the lack of fidelity and congruency between reported mutations and the “curated” databases is more than a headache. I have personally spent hours chasing down one mutation only to remain uncertain as to whether or not I had located the proper nucleotide in the end.
What is the issue? Did the original researchers not properly locate the mutation? Is the Reference Sequence incorrect? What can be done to increase the ease with which we can not only locate a specific phenotype-associated variant, but determine the nucleotides immediately surrounding that locus as well?
Understanding the Genomic Databases
The National Center for Biotechnology Information (NCBI) provides an excellent summary of the differences between its two major databases: RefSeq and GenBank. The major distinction for those interested in clinical genetics is the fact that RefSeq is curated and GenBank is not. GenBank is essentially a public repository of all DNA sequences made available by researchers who are responsible for updating and maintaining their own submitted sequences. RefSeq is the database that attempts to provide one sequences as the “Reference Sequence,” usually determined by looking at the most common variants among the GenBank entries. RefSeq sequences are occasionally updated which is why it is always important to include the accession number of the gene when using the Reference Sequence.
RefSeq has greater utility because it provides linked records between the genomic DNA, the mRNA transcript (and cDNA record), and the translated protein. It would seem that locating a non-synonymous mutation should take just a few clicks.
An Arcane Nomenclature System
Okay, I admit, arcane is a bit harsh, but it is time to update our nomenclature system. The Human Genome Variation Society (HGVS) provides the guidelines that are currently considered the best practice in naming sequence variants. They rely on coding DNA (cDNA) and essentially number the cDNA beginning with 1 for the first Adenine in the initiation (ATG) codon. There are obvious drawbacks to using cDNA, the most important being lack of numerical assignments for intronic nucleotides.
More and more intronic mutations are being found to account for deleterious phenotypes. Yet, the original (deprecated) system for naming intron variants looks something like this: IVS3+22G>A (interpretation: the G 22 basepairs from the beginning of intron 3 is changed to an A). There always seems to be some ambiguity when I come across mutations described this way: we can also write IVS3-98G>A (the G 98 basepairs from the end of intron 3 is changed to an A).
More recently, it has been recommend to describe intronic mutations according to the closest possible cDNA location. Our mutation above might now be described as c.195+22G>A (cDNA nucleotide 195 is the last nucleotide found on exon 3 (which immediately precedes intron 3). Or, we might write it as c.196-98G>A.
Let’s be practical for a second. This is terrible indexing. If I want to find an intronic mutation, it will be indexed according to the cDNA record of the relevant gene. Fine, but if I search the cDNA for nucleotide 195, the intron is spliced out of the transcript, so I cannot locate the sequence of interest there. Instead, I have to take the last few base pairs from exon 3 of the cDNA transcript, open up the gDNA (genomic DNA) transcript, and search for these base pairs. Then, once I have discovered where this exon ends in the genomic DNA, I must count 22 base pairs forward to locate the nucleotide of interest (this will all be in vain if the mutation was reported incorrectly in the literature).
The lesson here: although the RefSeq genomic DNA is linked to the cDNA and the protein records, the actual nucleotides/codons/amino acids are not indexed together.
A Standardized Index: Making Sure Mutation Nomenclature is Static
The cDNA naming convention grew out of the fact that the ability to completely sequence genomic DNA for the major part of the 20th century was very limited. It made sense for mutations to be reported based on the cDNA because it reflected the two more easily obtainable (and more static) records of that time: the mRNA transcript and the protein sequence. However, now that the Human Genome Project has paved the way for easier sequencing, and we are constantly improving a standard Reference Sequence, it makes much more sense to name mutations in terms of their genomic DNA (gDNA) location.
As reported by Flicek et al., the Locus Reference Genomic (LRG) sequence format is being developed specifically for the purpose of accurately indexing genomic variants. The LRG will provide a static record for every gene of interest, and this record WILL NOT CHANGE. It may be annotated, etc., but in the interest of preserving reported mutations, the backbone sequence (and subsequent base pair numbers will not change).
I have been using the Reference Sequence in a similar function whenever I locate a mutation for testing purposes. For example, I was searching for the p.G380R mutation in the FGFR3 gene which is a common cause of Achondroplasia. The cDNA variant I was interested in was c.G1138A. After searching for the mutation within the gene transcript, I located the surrounding base pairs, and determined that the genomic name for the mutation is g.G10458A. In my system, I start with the Adenine of the initiation codon as 1 for the gDNA as well as the cDNA. Thus, the gDNA and cDNA index numbers diverge once the first intron begins. For point mutations, I have written a script that automatically accomplishes this by interacting with the UCSC genome browser. An indexing revelation is what helps this script to work.
Developing a Functional, Interactive Reference Assembly
Although the data files are linked between RefSeq gDNA, cDNA and amino acid sequences, the files are not indexed to the extent that it is entirely useful. Ultimately, I would like to be able to browse a cDNA/gDNA sequence, perform a base-pair change, determine if this change is synonymous or non-synonymous, and find out if the change has been associated with any disease phenotypes (which would entail linking individual nucleotides to OMIM records). The development of such a functional variation browser will require a lot of forethought, smart programming, and a great deal of curation. However, I believe that development of the Locus Reference Genomic (LRG) sequence is a step in the right direction.
Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully RE, Proctor G, Chen Y, McLaren WM, Larsson P, Vaughan BW, Béroud C, Dobson G, Lehväslaiho H, Taschner PE, den Dunnen JT, Devereau A, Birney E, Brookes AJ, & Maglott DR (2010). Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome medicine, 2 (4) PMID: 20398331