2.5 | Genome annotation
To predict protein-coding genes, three approaches were used: de novo
gene prediction, homolog prediction, and RNA-sequencing annotation. For
de novo prediction, Augustus (v3.2.3), Geneid (v1.4), Genescan (v1.0),
GlimmerHMM (v3.04), and SNAP
(http://homepage.mac.com/iankorf/)
were applied to predict genes. For homolog prediction, the protein
sequences of twelve published plant genomes (A. tauschii ,B. distachyon , T. aestivum , T. durum , T.
dicoccoides , T. urartu , H. vulgare , O. sativa ,S. cereale , Sorghum bicolor , Z. mays, andArabidopsis thaliana ) were aligned to the genome using TblastN
(v2.2.26; E-value ≤1e-5), and then used Gene-Wise (v2.4.1) (Birney et
al., 2004) to predict gene structures. To optimize the genome
annotation, the RNA-seq reads were aligned to the genome using TopHat
(v2.0.11) (Trapnell et al., 2009), and the alignments were used as input
for Cufflinks (v.2.2.1) (Trapnell et al., 2012). The non-redundant
reference gene set was generated by merging genes predicted by three
methods with EvidenceModeler (v1.1.1) using PASA (Program to Assemble
Spliced Alignment) terminal exon support and including masked
transposable elements as input into gene prediction (Haas et al., 2008).
Genes functions were assigned according to the best match by aligning
the protein sequences to the Swiss-Prot (with a threshold of E-value
≤1e-5) (Bairoch and Apweiler, 2000). The motifs and
domains were annotated using InterProScan70 (v5.31) by searching against
publicly available databases, including ProDom, PRINTS, Pfam, SMRT,
PANTHER, and PROSITE (Mulder & Apweiler, 2008; Finn et al., 2014, 2015,
2017). The Gene Ontology (GO) IDs for each gene were assigned according
to the corresponding InterPro entry.