Genome annotation
Repeat sequences, accounting for 60.10% of the genome, were identified based on the assembled sequence of the T. polyphylla genome (Table S10 ). Of these, SSRs accounted for 0.18% of the repeat fraction, including 44,893 di-, 7,943 tri-, and 856 tetra-nucleotide repeats (Table S10-S11 ). We also identified 34,470 tandem repeats containing 2.39 Mb sequences, accounting for 0.58% of the T. polyphylla genome (Table S11 ). Overall, the combined results of the de novo and homology-based methods revealed that 57.29% of the T. polyphyllagenome contained TEs, of which Class I (retrotransposons) and Class II (DNA transposons) comprised 49.57% and 7.71% of the genome, respectively (Table S11 ). Of these, long terminal repeat (LTR) retrotransposons constituted the predominant repeat element in the genome, accounting for 45.02%. Further examination showed that two types of LTRs, Gypsy and Copia, occupied 25.39% and 4.07% of the genome sequences, respectively.
We identified 25,319 protein-coding genes in the T. polyphyllagenome (Table S12, Table S15 ), with average gene length, coding sequence length, and exon length estimated as 4192.8 bp, 1221.8 bp and 227.8 bp respectively, and the average exon number per gene was 5.36 (Fig. S3 ). In total, 23,041 genes were annotated in at least one of the five databases, accounting for 91% of the total genes (Fig. S4; Table S13 ). In addition to protein-coding genes, various non-coding RNA sequences were identified and annotated (Table S14 ), including 703 transfer RNAs, 607 ribosomal RNAs, 90 microRNAs, and 220 small nuclear RNAs.