2.5 Gene prediction and annotation
Based on the repeat masked genome, we employed de novo , homology-based and transcriptome-assisted predictions to detect the protein-coding genes. De novo gene prediction was performed using Augustus (v2.7) (Stanke et al., 2006) with the Danio reriotraining set and default settings. For homology-based prediction, protein sequences of Danio rerio, Takifugu rubripes, Gasterosteus aculeatus, Epinephelus lanceolatus, Epinephelus akaara, Oryzias latipesand Cynoglossus semilaevis were downloaded from NCBI database and aligned to the P. leopardus genome using tBLASTn (E-value≤1e-5). The homologous genome sequences were then aligned against the matching proteins using GeneWise (v2.4.0) (Doerks et al., 2002) for accurate spliced alignments. Transcriptomic data were generated from six RNA-Seq libraries constructed with six tissues, including gonad, liver, skin, spleen, muscle and fin, respectively. A total of 69.34 Gb clean data were aligned to the assembled genome sequences using HISAT2 (v2.0.10) (Pertea et al., 2016) and the putative transcript structures were detected using StringTie (v2.1.1) (Pertea et al., 2016). The candidate protein-coding regions within transcript sequences were then predicted with TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder/). Finally, genes predicted from the above methods were merged into a consensus gene set using Glean (Elsik et al., 2007).