Gene Structure Data in ENSEMBL: Where Did All the Introns go?

I’m starting a new company. For now it is called The Our purpose: to develop a highly curated, open-source, easy-to-use Genome Browser that can better help make sense of the mountains of genomic data. We want to make it like GitHub (for those of you who are computer scientists), but for Genomics. Only this way can we truly create an open-source mechanism for attacking the genome.

In the process of coding out our prototype, we’ve discovered so many more issues with the way data is currently stored. You can see a post from a while back where I discuss the incorrect nomenclature of the many monogenic mutations in OMIM. In this current post, I’ll be talking about the issue of gene structures and the way they have been recorded in ENSEMBL.

What are Gene Structures?

On the most basic level, genes are sections of linked nitrogenous bases that are transcribed by RNA polymerase transforming them from DNA to RNA (in humans). Genes are identified as lying within Open Reading Frames (ORFs). Within these ORFs, certain sections of the genes are transcribed into RNA while certain sections are not. Classically, the sections that are transcribed are referred to as exons, and the sections that are not are referred to as introns.

Once the exons are transcribed into RNA, the RNA molecule is then translated into a chain of amino acids. This amino acid chain folds into the proteins that we all know and love. Enzymes, substrates, channels, cell structures, and many more proteins are created in this fashion. The creation process can be summarized as:

DNA (exons, introns) -> Transcription -> RNA  (exons only) -> Translation -> Protein

Congrats. You can now pass Biology 101.

Reality check: For most genes, an ORF contains one more structure besides the exons and the introns – the Untranslated Regions (UTRs). These are regions that are transcribed, but are not translated. See what I did there…I italicized two similar, but importantly distinct words so that we don’t miss this point.

How does ENSEMBL store this data?

Let’s take a gene as a case study: SOD1 (Superoxide Disumtase 1), located on chromosome 21. Here’s an overview of the gene (courtesy of The

In this image, the blue regions represent exons, the gray regions represent introns, and the red regions represents UTRs. In the ENSEMBL structure database, 5 regions are listed. They have grouped the first region (red + blue) into a single entry, as well as the last region (red + blue) into a single entry. They refer to all of them as exons. None of the introns are listed.

While it is very possible to figure out introns based on the location of the exons, this is a mess when dealing with the data. Instead of searching directly for an intron, we must first identify the surrounding exons, etc. The other main problem is that UTRs are not separated from exons! Elsewhere, ENSEMBL lists the positions of the UTRs, but what this means is that in the case of the start and ending exons for this gene, we have to reference many different pieces of data to determine the ultimate structure of the gene! It’s a HUGE headache.

How does The fix this problem?

At the genome, we’ve taken care to index every type of gene structure according to its absolute position on the chromosome. In the above picture of SOD1, we count 11 structures (not 5 like ENSEMBL): 2 UTRs, 5 Exons, and 4 Introns. We’ve taken care to do this properly. Introns are structures too.

It’s been exciting building a new genome browser. Anything that annoys us about the old browsers (or quality of data) is within our capacity to correct in a scalable way. If you have any suggestions of features or fixes you’d like added to The, please feel free to contact me! The is currently in private beta, but if you think your group/lab would be good test dummies for us, go ahead and request an invite.


The Consumer Genetics Conference of 2011

Finally, after two years of unavoidable conflicts (*cough* college reunions *cough*), I’m going to be attending the Consumer Genetics Conference up in Boston. Luminaries and superstars of the field like George Church, Jay Flatley, and Jonathan Rothberg will be speaking.

The amazing thing about this grassroots effort to bring about meaningful discussions about the translation of genetics to the patient/consumer is that it is only $20 (and eleven cents)! Register here.

The conference will be taking place from June 7th to the 9th at the Hynes Convention Center in Boston. A detailed schedule of events can be found here. Discourse of this nature is extremely important, and I’m excited to be attending!

Osama’s DNA: Testing and Its Limitations

May brought with it great news for the United States: a man who was responsible for the vile slaying of thousands of innocent Americans through the attack of the World Trade Center on September 11, 2001 was finally removed from this world by the United States. It was a glorious announcement for the many families who lost loved ones, an accomplishment for our executive branch, a message to purveyors of fear, and most importantly a victory for freedom.

Science Helped Out Too

As already reported, Federal Government officials confirmed Osama’s identity through forensic science analysis of his DNA. A sample from the man assassinated in Pakistan was matched against a subpoenaed sample of his sister, who had died of brain cancer. Cristie Wilcox over at Scientific American reports, from a technical standpoint, about how this could have been accomplished so rapidly. Scientists played an important part in verifying and maintaining public confidence in Osama’s death.

Identity Testing vs. Siblingship Testing

One point I would like to make is that any test that is performed comparing Osama to his sister is not directly confirming Osama’s identity. It is confirming that the assassinated man is related to Osama’s sister. In order to directly test for identity, DNA would have to matched against a previous sample from the same person.

Visually Sibling Matching vs. Identity Matching

The ideograms above represent two things. On the left, two siblings’ DNA are compared to one another. It is obvious that recombination consistent with siblingship has occurred. On the right, an individual’s DNA was compared to…his own DNA. There is complete identity across all 22 autosomes (I did not look at sex chromosomes with this program). There is a big difference between these two.

What Was Confirmed with the Osama DNA Test?

When they tested Osama’s sister’s DNA, they proved that the man they killed is definitely Osama’s Sister’s sibling (follow me?). Technically, they did not test for identity as they did not match Osama’s DNA against a previous sample. Theoretically, another sibling of theirs could have been the person who was killed. I’m pretty sure it was him though. I don’t know if they have a previous Osama sample, but that test would have been more definitive if they used that. This doesn’t change anything for me…I’m glad he’s gone.

Ion Torrent: The Dark Side of DNA Sequencing

Ion Torrent. Even the name is cool. It suggests a deluge of charged particles that are here to drastically change the way we do science. Whether or not the technology ends up being as disruptive as surrounding press would suggest, there is no question that their core sequencing technology represents a huge deviation from the techniques used for sequencing in the past decades.

Sanger Sequencing: The Chain Termination Method

*Most of the following information was adapted from the Wikipedia entry on DNA Seqnencing.

Developed by Ferderick Sanger in 1975, it would contain a DNA synthesis reaction mixture of normal deoxynucleotide triphosphates (dNTPs) along with dideoxynucleotide triphosphates (ddNTPs). These ddNTPs would be incorporated into a chain of nucleic acids much like the normal dNTPs. However, the ddNTPs lack the oxygen from the 3′-hydroxyl (-OH) group necessary to form a phosphodiester bond and continue the expansion of the DNA chain. Once a ddNTP is incorporated into a DNA strand, there is no further expansion of that oligonucleotide by DNA polymerase.

By labeling the four different types of ddNTPs (A, T, C and G) with different fluorescent signals, once a ddNTP is incorporated into the elongating DNA strand, the light signals emitted by the molecule allow for the interpretation of the base at that position. Originally, oligonucleotides would then be separated by gel electrophoresis to understand at exactly which position the base was incorporated, but eventually, more automated methods came about to help improve the process. The following video does a good job of explaining exactly how chain termination methods work.

Knowledge in Light: The Dominance of Fluorescence

Fluorescence signaling has been the primary and dominant method for sequencing for the past few decades. As seen in the picture to the left, nucleotide addition results in the release of a diphosphate group, the energy of which has been used to catalyze the fluorescence during Sanger sequencing. The ion torrent team cleverly realized that diphosphate is only one of the reaction products released upon incorporation of a nucleic acid. Formation of the phosphodiester bond between the new nitrogenous base and the previously incorporated base also results in the release of a H+ from the 3′ hydroxyl group of the chain backbone.

Instead of using the energy released from the diphosphate groups to catalyze fluorescence, the ion torrent team decided to try something radically different: an attempt to determine nucleotide addition by monitoring the release of hydrogen ions. This method requires no light, a huge deviation from the past decades of sequencing.

Sequencing in the Dark: A Semiconductor Platform

Described in 2006 by Nader Pourmand, Ronald Davis and their team at Stanford, ion torrent determines DNA sequence through electrical detection. As pictured here, the release of hydrogen ions induces a current that can be measured on their CMOS based platform. Rothberg and his team have built out a robust and disruptive platform that takes advantage of this principle.

From a practical standpoint, it seems as though the semiconductor chips have ample space to affix many different primers targeting specific regions of the genome. Once genomic DNA is added, sequencing essentially takes place by performing a series of washes. Each wash contains one of the four possible nitrogenous bases (A, T, C, or G). If a base is incorporated in a well, the release of hydrogen ions will be detected via current induction in that well. No light required.

I’m excited to see how this technology scales. For the time being, the limitations seem to be extremely similar to Illumina’s MiSeq. However, they are offering a much cheaper price point per sample, and they claim that scalability is going to be on par with what we’ve seen for the past decade of semiconductors. More on the difference between the two here.

While much of the talk is centered around the total number of base pairs that can be read in parallel, one of my main concerns is actually about the length of individual reads. In the quest to shrink and optimize whole genome sequencing, it is important that full genomic coverage is achieved. Shorter read lengths will be incapable of testing for certain parts of the genome where large repetitive elements or pseudo-genes prevent one from easily discerning the true sequence. Right now both companies claim that the max read lengths will be around 400bp for their latest platforms. For the time being, this is great. However, I predict that the future will look better for the company that is able to determine more bases per read.

Sequence Variants and the Genomic Databases: Standardizing the Nomenclature

For those interested in clinical diagnostics of genetic diseases, the ability to use the molecular information presented within our various genomic databases is somewhat limited. If you attempt to locate the common p.Q12X mutation within the NCBI Reference Sequence of the AMPD1 gene (which causes Adenosine Monophosphate Deaminase Deficiency), you will find that the 12th codon does not correspond to Glutamic Acid (the mutation, according to the current RefSeq gene (NM_000036.2), is actually p.Q45X) . For researchers and clinicians looking for SNP primers or probes for specific mutations, the lack of fidelity and congruency between reported mutations and the “curated” databases is more than a headache. I have personally spent hours chasing down one mutation only to remain uncertain as to whether or not I had located the proper nucleotide in the end.

What is the issue? Did the original researchers not properly locate the mutation? Is the Reference Sequence incorrect? What can be done to increase the ease with which we can not only locate a specific phenotype-associated variant, but determine the nucleotides immediately surrounding that locus as well?

Understanding the Genomic Databases

The National Center for Biotechnology Information (NCBI) provides an excellent summary of the differences between its two major databases: RefSeq and GenBank. The major distinction for those interested in clinical genetics is the fact that RefSeq is curated and GenBank is not. GenBank is essentially a public repository of all DNA sequences made available by researchers who are responsible for updating and maintaining their own submitted sequences. RefSeq is the database that attempts to provide one sequences as the “Reference Sequence,” usually determined by looking at the most common variants among the GenBank entries. RefSeq sequences are occasionally updated which is why it is always important to include the accession number of the gene when using the Reference Sequence.

RefSeq has greater utility because it provides linked records between the genomic DNA, the mRNA transcript (and cDNA record), and the translated protein. It would seem that locating a non-synonymous mutation should take just a few clicks.

An Arcane Nomenclature System

Okay, I admit, arcane is a bit harsh, but it is time to update our nomenclature system. The Human Genome Variation Society (HGVS) provides the guidelines that are currently considered the best practice in naming sequence variants. They rely on coding DNA (cDNA) and essentially number the cDNA beginning with 1 for the first Adenine in the initiation (ATG) codon. There are obvious drawbacks to using cDNA, the most important being lack of numerical assignments for intronic nucleotides.

More and more intronic mutations are being found to account for deleterious phenotypes. Yet, the original (deprecated) system for naming intron variants looks something like this: IVS3+22G>A (interpretation: the G 22 basepairs from the beginning of intron 3 is changed to an A). There always seems to be some ambiguity when I come across mutations described this way: we can also write IVS3-98G>A (the G 98 basepairs from the end of intron 3 is changed to an A).

More recently, it has been recommend to describe intronic mutations according to the closest possible cDNA location. Our mutation above might now be described as c.195+22G>A (cDNA nucleotide 195 is the last nucleotide found on exon 3 (which immediately precedes intron 3). Or, we might write it as c.196-98G>A.

Let’s be practical for a second. This is terrible indexing. If I want to find an intronic mutation, it will be indexed according to the cDNA record of the relevant gene. Fine, but if I search the cDNA for nucleotide 195, the intron is spliced out of the transcript, so I cannot locate the sequence of interest there. Instead, I have to take the last few base pairs from exon 3 of the cDNA transcript, open up the gDNA (genomic DNA) transcript, and search for these base pairs. Then, once I have discovered where this exon ends in the genomic DNA, I must count 22 base pairs forward to locate the nucleotide of interest (this will all be in vain if the mutation was reported incorrectly in the literature).

The lesson here: although the RefSeq genomic DNA is linked to the cDNA and the protein records, the actual nucleotides/codons/amino acids are not indexed together.

A Standardized Index: Making Sure Mutation Nomenclature is Static

The cDNA naming convention grew out of the fact that the ability to completely sequence genomic DNA for the major part of the 20th century was very limited. It made sense for mutations to be reported based on the cDNA because it reflected the two more easily obtainable (and more static) records of that time: the mRNA transcript and the protein sequence. However, now that the Human Genome Project has paved the way for easier sequencing, and we are constantly improving a standard Reference Sequence, it makes much more sense to name mutations in terms of their genomic DNA (gDNA) location.

As reported by Flicek et al., the Locus Reference Genomic (LRG) sequence format is being developed specifically for the purpose of accurately indexing genomic variants. The LRG will provide a static record for every gene of interest, and this record WILL NOT CHANGE. It may be annotated, etc., but in the interest of preserving reported mutations, the backbone sequence (and subsequent base pair numbers will not change).

I have been using the Reference Sequence in a similar function whenever I locate a mutation for testing purposes. For example, I was searching for the p.G380R mutation in the FGFR3 gene which is a common cause of Achondroplasia. The cDNA variant I was interested in was c.G1138A. After searching for the mutation within the gene transcript, I located the surrounding base pairs, and determined that the genomic name for the mutation is g.G10458A. In my system, I start with the Adenine of the initiation codon as 1 for the gDNA as well as the cDNA. Thus, the gDNA and cDNA index numbers diverge once the first intron begins. For point mutations, I have written a script that automatically accomplishes this by interacting with the UCSC genome browser. An indexing revelation is what helps this script to work.

Relational Indexing: Relating the cDNA to the gDNA

I have been developing my own library of RefSeq genome transcripts, but I have been indexing them more efficiently. My gDNA files, as previously mentioned,  are labeled 1 at the beginning of the initiation codon, however, the non-coding transcribed region preceding the initiation codon is labeled from -1 downward beginning at the nucleotide immediately preceding ATG (there is no 0). From the start codon, the gDNA positions are numbered sequentially. Similiarly, the cDNA transcript begins with the start codon, and is numbered from 1 until the end of the stop codon.

To relate the cDNA to the gDNA, I use exon indexing. For cDNA, every nucleotide is assigned to an exon, and these nucleotides are numbered from 1 until the end of the exon. Thus, if exon 3 is 79 base pairs long, the nucleotides within exon 3 will also be numbered from 1 to 79. For gDNA, every nucleotide is assigned to an exon or intron, and these nucleotides are numbered from 1 until the end of the exon/intron. Thus, we can relate the cDNA position to the gDNA position through the exon indexing.

For example, if I want to determine the gDNA position of cDNA locus 285, the analysis is very simple:

  1. Find nucleotide 285 within the cDNA file.
  2. Determine the exon of nucleotide 285 (exon 2).
  3. Determine the index within exon 2 of the nucleotide (exon 2, position 83).
  4. Locate exon 2, position 83 within the gDNA file.
  5. Determine the gDNA position associated with exon 2, position 83 (382).

In this way, we have effortlessly determined that c.285 corresponds to g.382.

Developing a Functional, Interactive Reference Assembly

Although the data files are linked between RefSeq gDNA, cDNA and amino acid sequences, the files are not indexed to the extent that it is entirely useful. Ultimately, I would like to be able to browse a cDNA/gDNA sequence, perform a base-pair change, determine if this change is synonymous or non-synonymous, and find out if the change has been associated with any disease phenotypes (which would entail linking individual nucleotides to OMIM records). The development of such a functional variation browser will require a lot of forethought, smart programming, and a great deal of curation. However, I believe that development of the Locus Reference Genomic (LRG) sequence is a step in the right direction.


Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully RE, Proctor G, Chen Y, McLaren WM, Larsson P, Vaughan BW, Béroud C, Dobson G, Lehväslaiho H, Taschner PE, den Dunnen JT, Devereau A, Birney E, Brookes AJ, & Maglott DR (2010). Locus Reference Genomic sequences: an improved basis for describing human DNA variants. Genome medicine, 2 (4) PMID: 20398331

RSS for Posts RSS for Comments