De-mulitplexing, Quality Filtering, and Phylogeny Reconstruction
Samples were demultiplexed using the MiSeq Reporter Generate FASTQ workflow. Primer sequences were trimmed from forward and reverse sequence reads using cutadapt v.1.4.2 (Martin, 2011). Following de-multiplexing, samples were processed using the QIIME2 v.2018.2 pipeline (https://docs.qiime2.org/2018.2/). DADA2 was used to filter out PhiX reads and chimeras, truncate the length of reads (forward reads cut at 200bp, reverse reads cut at 180bp), and cluster reads into unique amplicon sequence variants (ASVs) corrected for Illumina sequencing errors (Callahan et al., 2016). Reads were aligned using the MAFFT plugin in QIIME2 (FFT-NS-i; Katoh et al., 2002, 2005; Katoh & Toh, 2007). Default parameters were used to mask highly variable regions of the alignment and reconstruct a phylogeny using the FastTree2 plugin (Price et al., 2010), which was midpoint-rooted. The GreenGenes Database, v.13.5, trimmed to only the 16S rRNA V4 region, was used as a reference to train a naïve Bayes q2-feature-classifier, which was used to taxonomically identify ASVs in our data (DeSantis et al., 2006).
Contamination is ubiquitous in microbiome studies and especially problematic for low biomass samples (Eisenhofer et al., 2019; Salter et al., 2014; S. Weiss et al., 2014). To reduce the impact of contaminants, several filtering steps were performed. First, any bacterial taxon detected in the negative controls was removed from all other samples, with the exception of Arsenophonus . This genus of bacteria contains known symbionts of insects (Nováková et al., 2009) and is expected to be associated with bat flies (Wilkinson et al., 2016). AsArsenophonus is highly abundant in the samples sequenced for this study, it may be that its detection in the extraction control (0.7% of 3,524 total reads) and PCR control (55% of 63 total reads) is due to index bleed, a known issue when multiplexing samples (Eisenhofer et al., 2019; Kircher et al., 2012; Mitra et al., 2015), and it is not treated as a contaminant here. Next, bacterial genera were removed that are known laboratory contaminants (Eisenhofer et al., 2019), as were reads that were classified as being derived from mitochondria, chloroplast, or Archaea, or those that could not be classified beyond phylum. Data were exported from QIIME2 and reformatted for import into the R packagephyloseq v.1.26.1 (McMurdie & Holmes, 2012, 2013) for further decontamination and all downstream analyses. We used the R packagedecontam to identify ASVs whose frequency is inversely correlated with initial library concentration (Davis et al., 2018). Nine additional ASVs were identified as potential contaminants and eliminated from the dataset. Arsenophonus was not identified as a contaminant bydecontam . Finally, the data were filtered by two coverage depths: 1) all ASVs present in a sample at <0.01% relative abundance were eliminated from that sample, and 2) all ASVs present at <0.1% relative abundance were eliminated from that sample. At a minimum relative abundance of 0.01%, spurious ASVs may remain in the dataset, but at a minimum relative abundance of 0.1%, rare ASVs may be incorrectly excluded (Alberdi et al., 2018; Bokulich et al., 2013). Analyses were performed on both datasets.
As low concentration libraries were used to dilute high concentration libraries prior to sequencing, sequencing effort across samples is not even. Low concentration samples may be more prone to contamination due to their low biomass or may not receive adequate sequencing depth to describe the diversity of the bacterial community. To assess the bacterial diversity captured by low concentration libraries compared to high concentration libraries, the ggrare function was used from the phyloseq-extended suite of tools, which wraps the functionrarefy from the package vegan v.2.5.4 while maintaining the phyloseq data structure (https://github.com/mahendra-mariadassou/phyloseq-extended/blob/master/R/graphical_methods.R ; (Oksanen et al., 2010).