Genome annotation
Repeat sequences, accounting for
60.10% of the genome, were identified based on the assembled sequence
of the T. polyphylla genome (Table S10 ). Of these, SSRs
accounted for 0.18% of the repeat fraction, including 44,893 di-, 7,943
tri-, and 856 tetra-nucleotide repeats (Table S10-S11 ). We also
identified 34,470 tandem repeats containing 2.39 Mb sequences,
accounting for 0.58% of the T. polyphylla genome (Table
S11 ). Overall, the combined results of the de novo and
homology-based methods revealed that 57.29% of the T. polyphyllagenome contained TEs, of which Class I (retrotransposons) and Class II
(DNA transposons) comprised 49.57% and 7.71% of the genome,
respectively (Table S11 ). Of these, long terminal repeat (LTR)
retrotransposons constituted the predominant repeat element in the
genome, accounting for 45.02%. Further examination showed that two
types of LTRs, Gypsy and Copia, occupied 25.39% and 4.07% of the
genome sequences, respectively.
We identified 25,319 protein-coding genes in the T. polyphyllagenome (Table S12, Table S15 ), with average gene length, coding
sequence length, and exon length estimated as 4192.8 bp, 1221.8 bp and
227.8 bp respectively, and the average exon number per gene was 5.36
(Fig. S3 ). In total, 23,041 genes were annotated in at least
one of the five databases, accounting for 91% of the total genes
(Fig. S4; Table S13 ). In addition to protein-coding
genes, various non-coding RNA sequences were identified and annotated
(Table S14 ), including 703 transfer RNAs, 607 ribosomal RNAs,
90 microRNAs, and 220 small nuclear RNAs.