2.6 C. japonica genome repeat element, coding gene and
NcRNA annotation
C. japonica genome annotation was carried out from three
perspectives, including repeat recognition, non-coding RNA (ncRNA) and
gene structure prediction, and functional annotation.
As an important part of the genome, repeat elements, including tandem
repeat and interpersed repeat (also known as transposon element
[TE]). In the present study, the Tandem Repeat Finder software
(Benson, 1999) was applied to find the tandem repeats in the C.
japonica genome sequences. Meanwhile, the RepeatMasker and
RepeatProteinMask software (http://www.repeatmasker.org) were executed
to annotate the interpersed repeats of the C. japonica genome
sequence based on the Repbase database (Jurka et al., 2005).
Furthermore, the RepeatMasker software (Bedell et al., 2000) was used to
compare the genome sequence to the repeat element database obtained by
the abovementioned methods to obtain a set of repeat elements. The
ultimate C. japonica genome repeat elements were obtained by
removing the redundant repeat elements in the three methods.
Coding gene annotation includes structural prediction and functional
annotation. First, three prediction strategies, including homologs, ab
initio, and RNA-seq reads, were applied to predict the coding genes. In
the present study, Eriocheir sinensis , Penaeus monodon ,Penaeus vanamei , and P. trituberculatus were selected as
they are closely related to C. japonica, and the protein
sequences of these species were downloaded for the structural prediction
of C. japonica coding genes. Ab initio coding gene prediction was
performed using Augustus (version 2.7) (Stanke et al., 2006) and the
GenScan software (Burge and Karlin, 1997) with default settings. The
filtered RNA-seq reads were mapped to the C. japonica genome
sequences for transcript assembly using the TopHat software, and the
Cufflinks software (Ghosh and Chan) was then used to predict the coding
genes. The MAKER2 software (Carson and Mark, 2011) was used to remove
the redundancy of coding genes predicted by the abovementioned methods,
and
the
HiCESAP process was used to obtain more complete and accurate coding
gene datasets. Predicted coding genes were then functionally annotated
using
InterPro
(Zdobnov and Apweiler, 2001), Gene Ontology (GO) (Ashburner et al.,
2000), Kyoto Encyclopedia of Genes and Genomes (KEGG)_ALL (Kanehisa and
Goto, 2000), KEGG Orthology (KEGG_KO) (Kanehisa and Goto, 2000),
Swiss-Prot (Bairoch and Apweiler, 2000), Translation of European
Molecular Biology Laboratory nucleotide sequence (TrEMBL) (Boeckmann et
al., 2003), TF, Pfam (Griffiths-Jones et al., 2005), NR, and Eukaryotic
Orthologous Groups (KOG) (Tatusov et al., 2003) databases to determine
the biological function and metabolic pathways involved in the coding
gene products.
NcRNAs, such as ribosomal RNA (rRNA), microRNA (miRNA), transfer RNA
(tRNA), and small nuclear RNA (snRNA), are RNAs that do not translate
proteins but have important biological functions. MiRNA functions in
gene silencing and can degrade its target gene or inhibit the
translation of the target gene into protein. TRNA and rRNA are directly
involved in protein synthesis. SnRNA is involved in the processing of
RNA precursors and is the main component of RNA spliceosomes. The
tRNAscan-SE software (version 1.3.1) (Lowe and Eddy, 1997) can be used
to search for tRNA sequences in the C. japonica genome according
to the structural characteristics of tRNA. Considering the high
conservation of rRNA, the
BLASTN
software (Altschul et al., 1990) can be used to search for rRNA in theC. japonica genome based on the rRNA sequences of closely related
species. Additionally, miRNA and snRNA were predicted using the INFERNAL
software (version 1.1) (Nawrocki, 2014).