Gene Prediction and Annotation
A combination of de novo -, homology- and transcript-based methods was used for gene prediction. A comprehensive transcriptome database was built with the PASA pipeline v2.1.0 (Haas et al., 2003). After quality filtering with Trimmomatic v0.33 (Bolger et al., 2014), a de novoassembly was performed on Illumina RNA-seq reads using Trinity v2.6.6 (Haas et al., 2013). Then, genome-guided transcripts were created using (1) the genome-guided mode implemented in Trinity and (2) the HISAT-StringTie pipeline v1.3.3b (Pertea et al., 2015). Homologs were predicted by mapping protein sequences from A. thaliana, Aethionema arabicum, Arabidopsis lyrata, B. rapa , Capsella rubella, Carica papaya , Eutrema salsugineum andLeavenworthia alabamica to the M. pygmaea genome using tblastn (E-value ≤ 1e−5), and exonerate v2.4.0 was used for gene annotation (Slater & Birney, 2005). A de novo gene prediction was performed with Augustus v3.2.3 with parameters trained using PASA self-trained gene models (Stanke et al., 2004) and with GlimmerHMM v3.0.4 (Majoros et al., 2004). Gene models from the three main sources (i.e., aligned transcripts, de novo predictions and aligned proteins) were merged to produce consensus models by EVidenceModeler v1.1.1 (Haas et al., 2008). The functional assignments for all genes were generated by alignment to public protein databases including Swiss-Prot and TrEMBL (Bairoch & Apweiler, 2000). Protein domains were annotated by searching against InterPro (Zdobnov & Apweiler, 2001). Predicted gene functions and metabolic pathways were annotated using Blast2GO v2.5 (Conesa et al., 2005) and the GO (Consortium, 2004) and KEGG databases (Kanehisa et al., 2012). We further extracted collinear paralogous genes and calculated synonymous substitution rates (Ks) to examine potential whole-genome duplication (WGD) events.