Gene Prediction and Annotation
A combination of de novo -, homology- and transcript-based methods
was used for gene prediction. A comprehensive transcriptome database was
built with the PASA pipeline v2.1.0 (Haas et al., 2003). After quality
filtering with Trimmomatic v0.33 (Bolger et al., 2014), a de novoassembly was performed on Illumina RNA-seq reads using Trinity v2.6.6
(Haas et al., 2013). Then, genome-guided transcripts were created using
(1) the genome-guided mode implemented in Trinity and (2) the
HISAT-StringTie pipeline v1.3.3b (Pertea et al., 2015). Homologs were
predicted by mapping protein sequences from A. thaliana,
Aethionema arabicum, Arabidopsis lyrata, B. rapa , Capsella
rubella, Carica papaya , Eutrema salsugineum andLeavenworthia alabamica to the M. pygmaea genome using
tblastn (E-value ≤ 1e−5), and exonerate v2.4.0 was used for gene
annotation (Slater & Birney, 2005). A de novo gene prediction
was performed with Augustus v3.2.3 with parameters trained using PASA
self-trained gene models (Stanke et al., 2004) and with GlimmerHMM
v3.0.4 (Majoros et al., 2004). Gene models from the three main sources
(i.e., aligned transcripts, de novo predictions and aligned
proteins) were merged to produce consensus models by EVidenceModeler
v1.1.1 (Haas et al., 2008). The functional assignments for all genes
were generated by alignment to public protein databases including
Swiss-Prot and TrEMBL (Bairoch & Apweiler, 2000). Protein domains were
annotated by searching against InterPro (Zdobnov & Apweiler, 2001).
Predicted gene functions and metabolic pathways were annotated using
Blast2GO v2.5 (Conesa et al., 2005) and the GO (Consortium, 2004) and
KEGG databases (Kanehisa et al., 2012). We further extracted collinear
paralogous genes and calculated synonymous substitution rates (Ks) to
examine potential whole-genome duplication (WGD) events.