On genome coordinate systems and transposable element annotation

[Update: For an extended discussion of the issues in this post, see: Bergman (2012) A proposal for the reference-based annotation of de novo transposable element insertions" Mobile Genetic Elements 2:51 - 54]

Before embarking on any genome annotation effort, it is necessary to establish the best way to represent the biological feature under investigation. This post discusses how best to represent the annotation of transposable element (TE) insertions that are mapped to (but not present in) a reference genome sequence (e.g. from mutagenesis or re-sequencing studies), and how the standard coordinate system in use today causes problems for the annotation of TE insertions.

There are two major coordinate systems in genome bioinformatics, that differ primarily in whether they anchor genomic feature to (“base coordinate systems”) or between (“interbase coordinate systems”) nucleotide positions. Most genome annotation portals (e.g. NCBI or Ensembl), bioinformatics software (e.g. BLAST) and annotation file formats (e.g. GFF) use the base coordinate system, which represents a feature starting at the first nucleotide as position 1. In contrast, a growing number of systems (e.g. UCSC, Chado, DAS2) employ the interbase coordinate system, whereby a feature starting at the first nucleotide is represented as position 0. Note, the UCSC genome bioinformatics team actually use both systems and refer to the base coordinate system as “one-based, fully-closed” (used in the UCSC genome browser display) and interbase coordinate system as “zero-based, half-open” (used in their tools and file formats), leading to a FAQ about this issue by users. The interbase coodinate system is also referred to as “space-based” by some authors.

The differences between base (bottom) and interbase (top) coordinate system can be visualized in the following diagram (taken from the Chado wiki).

There are several advantage for using the interbase coordinate system including: (i) the ability to represent features that occur between nucleotides (like a splice site), (ii) simpler arithmetic for computing the length of features (length=end-start) and overlaps (max(start1,start2), min(end1,end2)) and (iii) more rational conversion of coordinates from the positive to the negative strand (see discussion here).

So why is the choice of coordinate system important for the annotation of TEs mapped to a reference sequence? The short answer is that TEs (and other insertions) that are not a part of the reference sequence occur between nucleotides in the reference coordinate system, and therefore it is difficult to accurately represent the location of a TE on base coordinates. Nevertheless, base coordinate systems dominate most of genome bioinformatics and are an established framework that one has to work within.

How then should we annotate TE insertions on base coordinates that are mapped precisely to a reference genome? If a TE insertion in reality occurs between positions X and X+1 in a genome, do we annotate the start and end position both at the same nucleotide? If so, do we annotate the start/stop coordinate at position X, or both at position X+1? If we chose to annotate the insertion at position X, then we need to invoke a rule that the TE inserts after nucleotide X. However this solution breaks down if the insertion is on the negative strand, since we either need to map a negative strand insertion to X+1 or have a different rule for interpreting the placement of the TE on positive and negative strands. Alternatively, do we annotate the TE as starting at X and ending at X+1, attempting to fake interbase coordinates on a base coordinate system, but at face value implying that the TE insertion is not mapped precisely and spans 2 bp in the genome.

After grappling with this issue for some time, it seems that neither of these solutions is sufficient to deal with the complexities of TE insertion and reference mapping. To understand why, we must consider the mechanisms of TE integration and how TE insertions are mapped to the genome. Most TEs create staggered cuts to the genomic DNA that are filled on integration into the genome leading to short target site duplications (TSDs). Most TEs also target a palindromic sequence, and insert randomly with respect to orientation. A new TE insertion is typically mapped to the genome by sequencing a fragment that spans the TE into unique flanking DNA, either by directed (e.g. inverse/linker PCR) or random (e.g. shotgun re-sequencing) approaches. The TE-flank fragment can be obtained from the 5′ or 3′ end of the TE. However, where one places the TE insertion depends on whether one uses the TE-flank from the 5′ or 3′ end and the orientation of the TE insertion in the genome. As shown in the following diagram, for an insertion on the positive strand (>>>), a TE-flank fragment from the 5′ end is annotated to occur at the 3′ end of the TSD (shown in bold), whereas a 3′ TE-flank fragment is placed at the 5′ end of the TSD.  For an insertion on the negative strand (<<<), the opposite effect occurs. In both cases, TE-flank fragments from the 5′ and 3′ end map the TE insertion to different locations in the genome.

Thus, where one chooses to annotate a TE insertion relative to the TSD is dependent on the orientation of the TE insertion and which end is mapped to the genome. As a consequence, both the single-base and two-base representations proposed above are flawed, since TE insertions into the same target site are annotated at two different locations on the positive and negative strand. This issue lead us (in retrospect) to misinterpret some aspects of the P-element target site preference in a recent paper, since FlyBase uses a single-base coordinate system to annotate TE insertions.

As an alternative, I propose that the most efficient and accurate way to represent TE insertions mapped to a reference genome on base coordinates is to annotate the span of the TSD and label the orientation of the TE in the strand field. This formulation allows one to bypass having to chose where to locate the TE relative to the TSD (5′ vs. 3′, as is required under the one-base/two-base annotation framework), and can represent insertions into the same target site that occur on different strands. Furthermore, this solution allows one to use both 5′ and 3′ TE-flank information. In fact, the overlap between coordinates from the 5′ and 3′ TE-flank fragments defines the TSD. Finally, this solution requires no prior information about TSD length for a given TE family, and also accommodates TE families that generate variable length TSDs since the TSD is annotated on a per TE basis.

The only problem left open by this proposed solution is for TEs that do not create a TSD, which have been reported to exist. Any suggestions for a general solution that also allows for annotation of TE insertions without TSDs would be much appreciated….


6 Comments

  1. caseybergman

    For those who might be interested, RepeatMasker generates 1-based coordinates, based on the 1-based coordinate system provided by the underlying BLAST similarity search.

  2. I was thinking a lot about the real location of an insertion. I am not very enthusiastic about using zero-base coordinate system since, in my opinion, it introduces also some confusions. I think that choosing only one nucleotide coordinate system is better, even if the insertion actually occurs between two nucleotides. I consider that it is OK to symbolize the insertion site as the nucleotide having the highest number (genomic coordinate). Let us suppose that an insertion do not create a TSD and is located between nucleotides 100 and 101 in a reference genome. Then, by convention, 101 may be accepted as the single coordinate of insertion. The rationale would be that before nucleotide 100 (including its covalent bond with 99, since an oligonucleotide of size 100 contains just 99 phosphate bridges) there is no insertion in the genome (a clean situation), but after nucleotide 100 we definitely have a mutation, and after 100 is 101, but not zero. Anyways, it would be just a convention highlighting the so called importance of nucleotide 1 of the genome and not of the last nucleotide.

    A different problem appears when one is sequencing both ends of a transposon insertion and find out that, let us say, a P element insertion generates an octet duplication. In this situation, I consider that no software would be able to present an insertion coordinate represented by only one nucleotide, as long as you never know if the supplemental octet was created (inserted) at the left or at the right side of a transposon insertion. For example, if one is sequencing at both ends, our Genome ARTIST tool (please see the web link) offers him two points of insertion for each of the two sequences, one being x and the other one x +/- 7, and both of them are right as long as the position of the duplicated octet is not absolute, but relative. Therefore, we consider the the right mapping of insertion for such a situation is dual (x and x+/- 7) and it may be taken as a convention. Let us assume that a P insertion duplicates the octet ACCCCCCT (we may regard the second octet as an insertion also), where the genomic coordinate of A is 101 and the genomic coordinate of T is 108. The insertion could be mapped both at A (coordinate 101, since A is part of the duplicated octet but also of the genome) and also at T (coordinate 108, since T is part of the duplicated octet but also of the genome). If, because of technical or financial reasons, one is sequencing only one end of such an insertion, Genome ARTIST would display either 101 or 108 as the coordinate of the insertion, depending on the actual end that was sequenced, since any of the two coordinates are equally possible.

    Due to such different situations, I am not sure that a unique coordinate system is close to be developed, but I do hope that my points of view are of any help for you.

    Best regards,
    Alex

  3. caseybergman

    See also E. W. Dijkstra’s notes on “Why numbering should start at zero”

    http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF

  4. Mitch Skinner

    Just saw this on reddit; it’s nice to see people talking about it.

    At the time I was originally wrestling with this in JBrowse, I was testing with flybase data that had some features with negative coordinates. I’m not sure if they’re still there–I believe they were chromosome bands that had been mapped to before the beginning of the sequenced region.

    But once you have features with negative coordinates, you have to think about what happens at zero. If your coordinate system numbers bases (rather than spaces), then you have to decide where to put the zero, which is awkward. Or if you omit zero and go straight from -1 to 1 then the length calculation has to have a special case for features that span that transition. But handling zero is natural and consistent in interbase. Plus, as you say, it’s definitely simpler to describe insertions (and deletions) with interbase.

  5. caseybergman

    Hadn’t thought about the virtues of 0-based systems for going through the origin. And thanks for the heads-up about comments on reddit: http://www.reddit.com/r/bioinformatics/comments/rsp7t/on_genome_coordinate_systems_and_transposable/



Add Your Comment