Genome Assembly, Decontamination and Genome Assessment
To assemble the N. riversi genome, we first generated a long-read
genome assembly using the software CANU v2.0 (Koren et al. 2017).
Default parameters were used, with the estimated genome size set to 230
MB, except the lowCovSpan and lowCovDepth parameters were set to 0.5 and
0, respectively. To error correct the long reads, we polished the genome
by first mapping the Illumina short reads to the draft genome using
Minimap2 (Li 2018), then the mapped reads (sam files) and raw
reads (fastq files) were used to polish the draft genome using
Racon (Vaser et al. 2017). This draft long-read assembly
was used as a benchmark to assess our final genome assembly, as well as
to identify non-targeted sequences. To provide a quantitative assessment
of the draft genome completeness, we used BUSCO with the Endopterygota
reference gene set. We then used two methods to identify possible
non-target sequences in the assembly, BLOBtools v1 (Laetsch &
Blaxter 2017), which uses GC content and short-read coverage to help
identify foreign DNA, and the sendsketch tool in BBMap
v38.86 (Bushnell 2014), which uses a k-mer based approach to match
sequences to reference databases. For sendsketch, the number of
sequences was set to 100k and the sketch length was set to 200k. We
identified the most abundant microbial taxa (Spiroplasma andAcinetobacter ), and used their Genbank reference
sequences, as well as the N. brevicollis mitochondrial genome, to
filter the long-read data using Minimap2 (Li 2018). The
short-read genomic data was also screened with these reference sequences
using the bbduk tool in BBMap, and then normalized
using bbnorm. With these refined read sets, we generated a
final assembly using the hybrid assembler Haslr (Haghshenaset al. 2020). Runs of Haslr were conducted at different
settings of long-read coverage (10x, 20x and 25x), and each assembly was
then evaluated using BUSCO with the Endopterygota reference gene set. As
a final assembly step, we employed pair-end RNAseq data to join contigs
using P_RNA_scaffolder (Zhu et al. 2018). BUSCO was
run again on this assembly, and we compared the final RNA-scaffolded
haslr 20x assembly to the Tribolium castaneum v5.2
RefSeq assembly (GCF_000002335.3) using QUAST v 5.1.0rc1
(Gurevich et al. 2013). The genome assembly has been made
publicly available on NCBI (WGS Accession: JADQWA010000000).