3.4 Gene prediction and annotation
Combining the results from de novo , homology-based and transcriptome-assisted predictions, we successfully generated a non-redundant gene set composing of 25,248 protein-coding genes (Table 5 ). The statistics of the predicted gene models were compared to other teleost species, including D. rerio , O. latipes and T. rubripes , showing similar distribution patterns in mRNA length, CDS length, exon length, intron length and exon number (Figure 3) .
We annotated the predicted genes by comparing the protein sequences in several public gene databases, including SwissProt, KEGG and TrEMBL, using BLASTp (E-value≤1e-5). As a result, 92.3%, 84.6% and 96.4% of the predicted genes got positive hits in SwissProt, KEGG and TrEMBL database, respectively. We also employed InterProScan (v5.0) (Jones et al., 2014) to identify protein domains in multiple protein domain databases of InterPro (ProDom, HAMAP, PANTHER, TIGRFAMs, PRINTS, PIRSF, Gene3D, COILS, PROSITE, Pfam, SMART) (Mitchell et al., 2019) and Gene Ontology (GO), and 88.9% and 70.3% of the predicted genes were annotated in InterPro and GO database, respectively. Finally, a total of 24,364 genes (96.48% out of all predicted genes) were successfully functional annotated in at least one of these databases (Supplementary Table S2 ).
For non-coding genes, 843 tRNAs were identified using tRNAscan-SE (Chan & Lowe, 2019). 1,230 rRNA genes and 324 microRNAs were identified by searching homology against the human rRNA sequence and miRBase (Kozomara & Griffiths-Jones, 2014) database, respectively. Small nuclear RNAs were annotated by the infernal tool (Nawrocki & Eddy, 2013) (http://infernal.janelia.org/) using Rfam database (Kalvari et al., 2018) (Supplementary Table S3 ).