2.6 C. japonica genome repeat element, coding gene and NcRNA annotation
C. japonica genome annotation was carried out from three perspectives, including repeat recognition, non-coding RNA (ncRNA) and gene structure prediction, and functional annotation.
As an important part of the genome, repeat elements, including tandem repeat and interpersed repeat (also known as transposon element [TE]). In the present study, the Tandem Repeat Finder software (Benson, 1999) was applied to find the tandem repeats in the C. japonica genome sequences. Meanwhile, the RepeatMasker and RepeatProteinMask software (http://www.repeatmasker.org) were executed to annotate the interpersed repeats of the C. japonica genome sequence based on the Repbase database (Jurka et al., 2005). Furthermore, the RepeatMasker software (Bedell et al., 2000) was used to compare the genome sequence to the repeat element database obtained by the abovementioned methods to obtain a set of repeat elements. The ultimate C. japonica genome repeat elements were obtained by removing the redundant repeat elements in the three methods.
Coding gene annotation includes structural prediction and functional annotation. First, three prediction strategies, including homologs, ab initio, and RNA-seq reads, were applied to predict the coding genes. In the present study, Eriocheir sinensis , Penaeus monodon ,Penaeus vanamei , and P. trituberculatus were selected as they are closely related to C. japonica, and the protein sequences of these species were downloaded for the structural prediction of C. japonica coding genes. Ab initio coding gene prediction was performed using Augustus (version 2.7) (Stanke et al., 2006) and the GenScan software (Burge and Karlin, 1997) with default settings. The filtered RNA-seq reads were mapped to the C. japonica genome sequences for transcript assembly using the TopHat software, and the Cufflinks software (Ghosh and Chan) was then used to predict the coding genes. The MAKER2 software (Carson and Mark, 2011) was used to remove the redundancy of coding genes predicted by the abovementioned methods, and the HiCESAP process was used to obtain more complete and accurate coding gene datasets. Predicted coding genes were then functionally annotated using InterPro (Zdobnov and Apweiler, 2001), Gene Ontology (GO) (Ashburner et al., 2000), Kyoto Encyclopedia of Genes and Genomes (KEGG)_ALL (Kanehisa and Goto, 2000), KEGG Orthology (KEGG_KO) (Kanehisa and Goto, 2000), Swiss-Prot (Bairoch and Apweiler, 2000), Translation of European Molecular Biology Laboratory nucleotide sequence (TrEMBL) (Boeckmann et al., 2003), TF, Pfam (Griffiths-Jones et al., 2005), NR, and Eukaryotic Orthologous Groups (KOG) (Tatusov et al., 2003) databases to determine the biological function and metabolic pathways involved in the coding gene products.
NcRNAs, such as ribosomal RNA (rRNA), microRNA (miRNA), transfer RNA (tRNA), and small nuclear RNA (snRNA), are RNAs that do not translate proteins but have important biological functions. MiRNA functions in gene silencing and can degrade its target gene or inhibit the translation of the target gene into protein. TRNA and rRNA are directly involved in protein synthesis. SnRNA is involved in the processing of RNA precursors and is the main component of RNA spliceosomes. The tRNAscan-SE software (version 1.3.1) (Lowe and Eddy, 1997) can be used to search for tRNA sequences in the C. japonica genome according to the structural characteristics of tRNA. Considering the high conservation of rRNA, the BLASTN software (Altschul et al., 1990) can be used to search for rRNA in theC. japonica genome based on the rRNA sequences of closely related species. Additionally, miRNA and snRNA were predicted using the INFERNAL software (version 1.1) (Nawrocki, 2014).