3.4 Gene prediction and annotation
Combining the results from de novo , homology-based and
transcriptome-assisted predictions, we successfully generated a
non-redundant gene set composing of 25,248 protein-coding genes
(Table 5 ). The statistics of the predicted gene models were
compared to other teleost species, including D. rerio , O.
latipes and T. rubripes , showing similar distribution patterns
in mRNA length, CDS length, exon length, intron length and exon number
(Figure 3) .
We annotated the predicted genes by comparing the protein sequences in
several public gene databases, including SwissProt, KEGG and TrEMBL,
using BLASTp (E-value≤1e-5). As a result, 92.3%, 84.6% and 96.4% of
the predicted genes got positive hits in SwissProt, KEGG and TrEMBL
database, respectively. We also employed InterProScan (v5.0) (Jones et
al., 2014) to identify protein domains in multiple protein domain
databases of InterPro (ProDom, HAMAP, PANTHER, TIGRFAMs, PRINTS, PIRSF,
Gene3D, COILS, PROSITE, Pfam, SMART) (Mitchell et al., 2019) and Gene
Ontology (GO), and 88.9% and 70.3% of the predicted genes were
annotated in InterPro and GO database, respectively. Finally, a total of
24,364 genes (96.48% out of all predicted genes) were successfully
functional annotated in at least one of these databases
(Supplementary Table S2 ).
For non-coding genes, 843 tRNAs were identified using tRNAscan-SE (Chan
& Lowe, 2019). 1,230 rRNA genes and 324 microRNAs were identified by
searching homology against the human rRNA sequence and miRBase (Kozomara
& Griffiths-Jones, 2014) database, respectively. Small nuclear RNAs
were annotated by the infernal tool (Nawrocki & Eddy, 2013)
(http://infernal.janelia.org/) using Rfam database (Kalvari et al.,
2018) (Supplementary Table S3 ).