2.5 | Genome annotation
To predict protein-coding genes, three approaches were used: de novo gene prediction, homolog prediction, and RNA-sequencing annotation. For de novo prediction, Augustus (v3.2.3), Geneid (v1.4), Genescan (v1.0), GlimmerHMM (v3.04), and SNAP (http://homepage.mac.com/iankorf/) were applied to predict genes. For homolog prediction, the protein sequences of twelve published plant genomes (A. tauschii ,B. distachyon , T. aestivum , T. durum , T. dicoccoides , T. urartu , H. vulgare , O. sativa ,S. cereale , Sorghum bicolor , Z. mays, andArabidopsis thaliana ) were aligned to the genome using TblastN (v2.2.26; E-value ≤1e-5), and then used Gene-Wise (v2.4.1) (Birney et al., 2004) to predict gene structures. To optimize the genome annotation, the RNA-seq reads were aligned to the genome using TopHat (v2.0.11) (Trapnell et al., 2009), and the alignments were used as input for Cufflinks (v.2.2.1) (Trapnell et al., 2012). The non-redundant reference gene set was generated by merging genes predicted by three methods with EvidenceModeler (v1.1.1) using PASA (Program to Assemble Spliced Alignment) terminal exon support and including masked transposable elements as input into gene prediction (Haas et al., 2008).
Genes functions were assigned according to the best match by aligning the protein sequences to the Swiss-Prot (with a threshold of E-value ≤1e-5) (Bairoch and Apweiler, 2000). The motifs and domains were annotated using InterProScan70 (v5.31) by searching against publicly available databases, including ProDom, PRINTS, Pfam, SMRT, PANTHER, and PROSITE (Mulder & Apweiler, 2008; Finn et al., 2014, 2015, 2017). The Gene Ontology (GO) IDs for each gene were assigned according to the corresponding InterPro entry.