3 Ways that Staff Augmentation Can Help Your Business

The staff augmentation business model has been transforming across various industries and domains. The reason behind its increasing usage is its highly competent model that bridges the companies’…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The not so standard GFF3 standards

This article describes the difficulties in relating exon features to their parent genes across GFF3 annotations provided by different major annotation databases.

Annotation formats provided by each major annotation database

The GFF file has 9 basic columns separated by tabs:

One entry in a GFF3 file looks like this:

Now the first 8 columns are nice and good, these are consistent across all GFF3 files. However the 9th column is essentially an arbitrary collection of extra columns containing additional attributes. The specification of the GFF3 format prescribes some standard fields to use such as:

Because it allows you to relate an exon to its parent gene, which is something commonly of interest. Now the issue is that this information is not consistent across formats, for example in the RefSeq annotation this information is encoded in a field called Dbxref:

Worse than that it could be encoded as a sub-field of Dbxref such as in the case below:

This makes relating an exon to its parent gene a slightly more complicated process, especially in languages like C and C++ which are generally used for high performance processing of genomic data, but lack elegant standard facilities for string manipulations. The most complicated in this trio is the ENSEMBL GFF3:

Here there’s no way to directly discover the gene parent of an exon, because it is a 3rd level feature, with genes being a 1st level feature and the transcript being the 2nd level feature. This means you have to keep track of the transcripts you’ve already seen and find an exon’s parent transcript before tracking down the parent gene. This greatly complicates process of finding the parent gene of an exon.

One solution to this is to just use the general parent tracing method for all annotation types, since ID and Parent are standard attribute fields but this is complicated by the slightly different formats each database chose to store such information. In ENSEMBL you have:

(I have added spaces between entries for readability)

So you can keep a table of transcripts and their parent gene, then you can match the parent transcript of an exon to a gene. So in the third entry, which annoying does not have an ID, you would find it’s parent transcript:ENSMUST00000193812 and then the transcript’s parent gene:ENSMUSG00000102693. You’d probably also do some extra work to snip the feature type annotation off.

Then for Gencode:

You can trace the exon to its parent transcript ENSMUST00000193812.1, then to the parent gene ENSMUSG00000102693.1.

Now for RefSeq:

If you simply use a Parent to ID algorithm you’d end up with an uninformative gene0 where you really wanted 37102.

There are other subtleties such as the fact the fact that there are multiple types of genes and transcripts, so simply looking for “gene” and “transcript” to 1st and 2nd level features is insufficient. There are features annotated “pseudogene” or “lncRNA” and many other that should be treated equivalently to genes and transcripts.

The simplest solution then should be to first detect what type of annotation you’re working with, if you are working with RefSeq and Gencode then you can obtain the name of the gene on the same entry where the exon is found. If you are dealing with ENSEMBL then you would need to perform some lineage tracing to discover the parent gene’s identity.

Distinguishing between the providers of the annotation is unfortunately not particularly straight-forward. GENCODE for example has a helpful header line stating they are the provider:

RefSeq does not provide such a line, perhaps you could infer it “NCBI annotwriter” line:

ENSEMBL contains almost no unique identifiable information in the headers, one may use the presence of “sequence-region” or “genome-build Ensembl”:

Alternatively in the first entries of ENSEMBL and RefSeq annotations you find:

and

but this convention is not found in GENCODE where the first entry is:

So to robustly detect annotation source, in a way that is hopefully species agnostic, requires a few ad-hoc tricks which may be quite brittle. Since the parsing strategy is dependent on the annotation source, there is no room for error in the guessing algorithm.

A much more complicated, more memory intensive, but general solution would be to create a tree out of the GFF3 that reflects the hierarchical structure described while preserving full information at each node, then to obtain the gene-exon relationships through traversal. But who wants to do all that to parse what should have been a relatively simply flat table?

Additional resources:

Add a comment

Related posts:

Why Are Numbers Weird???

A brief look at the exceptions in various number systems. Did our ancestors use numbers of base 10? Why does 11 and 12 have different words from the rest?

Snap Users Revealed Their Base Starts to Slow Down

Snap Inc. revealed that their increasing number of daily users from past months is slowing down While there are many businesses and various industries that suffered the pandemonium that the…

Net Zero

When the amount of greenhouse gases entering the atmosphere equals the amount of gases removed from the atmosphere, the situation is called “net zero”. In its most basic form, net zero refers to a…