Genome Annotation
In order to annotate both structural and functional properties of the
genome, we used GenSAS v6 (Humann et al. 2019). We first
identified and masked interspersed repeats and low complexity DNA
sequences in the assembly. RepeatModeler v1.0.11 was run to
identify and produce a structural annotation of repeat regions de
novo (Smit & Hubley 2008), and subsequently RepeatMasker
v4.1.0 was used to generate a modified version of the genome with these
regions masked (Smit et al. 2015). We then employed
Braker v2.1.0 (Hoff et al. 2019; Stanke et al.2008; Stanke et al. 2006) to automatically predict gene models
for protein coding genes in the masked genome. The Braker
pipeline uses the tools GeneMark-ES/ET (Lomsadze et al.2005; Ter-Hovhannisyan et al. 2008) and Augustus
(Camacho et al. 2009), as well as evidence from RNAseq data, to
predict gene models in novel eukaryotic genomes. Paired-end RNAseq reads
were aligned to the genome using Hisat2 (Kim et al.2015), with default settings, and the alignment file was provided to
Braker. Finally, PASA v2.3.3 (Haas et al. 2008) was used
to refine the gene models, using the assembled transcriptome as input.
For the resulting consensus gene models, we assigned functional
annotations using a combination of six tools. Amino acid similarity to
proteins in the NCBI RefSeq invertebrate database was used for
functional annotation based on searches with both Diamond
v0.9.22 (Buchfink et al. 2015) and Blastp v2.7.1 (with
the settings: matrix = BLOSUM62, expect = 1e-8, word
size = 3, gap open penalty = 11, gap extend penalty = 1, maximum HSP
distance =30000). The gene set was further annotated based on the
presence of peptide domains, using InterProScan v5.29-68.0,
Pfam v1.6 (with the settings: e-value sequence = 1 and e-value
domain = 10), and SignalP v4.1 (with the settings: organism
group = eukaryotes, method = best, D-cutoff for noTM networks = 0.45,
D-cutoff for TM networks = 0.50, minimal predicted peptide length = 10,
and truncate sequence length = 70). Finally, the Kyoto
Encyclopedia of Genes and Genomes (KEGG) orthology terms were assigned
to each gene using the KEGG Automatic Annotation Server (Moriyaet al. 2007), based on bi-directional best hit searches of the
nucleotide sequences using blast.