I’m starting a new company. For now it is called The Geno.me. Our purpose: to develop a highly curated, open-source, easy-to-use Genome Browser that can better help make sense of the mountains of genomic data. We want to make it like GitHub (for those of you who are computer scientists), but for Genomics. Only this way can we truly create an open-source mechanism for attacking the genome.
In the process of coding out our prototype, we’ve discovered so many more issues with the way data is currently stored. You can see a post from a while back where I discuss the incorrect nomenclature of the many monogenic mutations in OMIM. In this current post, I’ll be talking about the issue of gene structures and the way they have been recorded in ENSEMBL.
What are Gene Structures?
On the most basic level, genes are sections of linked nitrogenous bases that are transcribed by RNA polymerase transforming them from DNA to RNA (in humans). Genes are identified as lying within Open Reading Frames (ORFs). Within these ORFs, certain sections of the genes are transcribed into RNA while certain sections are not. Classically, the sections that are transcribed are referred to as exons, and the sections that are not are referred to as introns.
Once the exons are transcribed into RNA, the RNA molecule is then translated into a chain of amino acids. This amino acid chain folds into the proteins that we all know and love. Enzymes, substrates, channels, cell structures, and many more proteins are created in this fashion. The creation process can be summarized as:
DNA (exons, introns) -> Transcription -> RNA (exons only) -> Translation -> Protein
Congrats. You can now pass Biology 101.
Reality check: For most genes, an ORF contains one more structure besides the exons and the introns – the Untranslated Regions (UTRs). These are regions that are transcribed, but are not translated. See what I did there…I italicized two similar, but importantly distinct words so that we don’t miss this point.
How does ENSEMBL store this data?
Let’s take a gene as a case study: SOD1 (Superoxide Disumtase 1), located on chromosome 21. Here’s an overview of the gene (courtesy of The Geno.me):
In this image, the blue regions represent exons, the gray regions represent introns, and the red regions represents UTRs. In the ENSEMBL structure database, 5 regions are listed. They have grouped the first region (red + blue) into a single entry, as well as the last region (red + blue) into a single entry. They refer to all of them as exons. None of the introns are listed.
While it is very possible to figure out introns based on the location of the exons, this is a mess when dealing with the data. Instead of searching directly for an intron, we must first identify the surrounding exons, etc. The other main problem is that UTRs are not separated from exons! Elsewhere, ENSEMBL lists the positions of the UTRs, but what this means is that in the case of the start and ending exons for this gene, we have to reference many different pieces of data to determine the ultimate structure of the gene! It’s a HUGE headache.
How does The Geno.me fix this problem?
At the genome, we’ve taken care to index every type of gene structure according to its absolute position on the chromosome. In the above picture of SOD1, we count 11 structures (not 5 like ENSEMBL): 2 UTRs, 5 Exons, and 4 Introns. We’ve taken care to do this properly. Introns are structures too.
It’s been exciting building a new genome browser. Anything that annoys us about the old browsers (or quality of data) is within our capacity to correct in a scalable way. If you have any suggestions of features or fixes you’d like added to The Geno.me, please feel free to contact me! The Geno.me is currently in private beta, but if you think your group/lab would be good test dummies for us, go ahead and request an invite.