2.5 Gene prediction and annotation
Based on the repeat masked genome, we employed de novo ,
homology-based and transcriptome-assisted predictions to detect the
protein-coding genes. De novo gene prediction was performed using
Augustus (v2.7) (Stanke et al., 2006) with the Danio reriotraining set and default settings. For homology-based prediction,
protein sequences of Danio rerio, Takifugu rubripes, Gasterosteus
aculeatus, Epinephelus lanceolatus, Epinephelus akaara, Oryzias latipesand Cynoglossus semilaevis were downloaded from NCBI database and
aligned to the P. leopardus genome using tBLASTn (E-value≤1e-5).
The homologous genome sequences were then aligned against the matching
proteins using GeneWise (v2.4.0) (Doerks et al., 2002) for accurate
spliced alignments. Transcriptomic data were generated from six RNA-Seq
libraries constructed with six tissues, including gonad, liver, skin,
spleen, muscle and fin, respectively. A total of 69.34 Gb clean data
were aligned to the assembled genome sequences using HISAT2 (v2.0.10)
(Pertea et al., 2016) and the putative transcript structures were
detected using StringTie (v2.1.1) (Pertea et al., 2016). The candidate
protein-coding regions within transcript sequences were then predicted
with TransDecoder (v5.5.0)
(https://github.com/TransDecoder/TransDecoder/). Finally, genes
predicted from the above methods were merged into a consensus gene set
using Glean (Elsik et al., 2007).