Bioinformatics data analysis
Quality-filtering
The raw output of sequencing instruments is converted to the FASTQ
format, which is compatible with all major quality-filtering tools. As
most bioinformatics platforms have been developed for bacterial 16S
data, these differ greatly in their capacity of handling fungal ITS
sequences that typically cannot be reliably aligned much beyond the
genus level (Anslan et al., 2018). Based on citations, the most broadly
used platforms include QIIME2, mothur, PIPITS, SEED2, SCATA and
PipeCraft. Features as well as pros and cons of the most popular and
recently developed platforms are presented in Table 1.
For Illumina and MGI-Tech instruments that produce paired-end reads, it
is recommended to assemble the paired-end reads, unless the amplicon is
longer than the paired reads combined (e.g., Bissett et al., 2016).
Generally it is advisable to disregard unpaired reads because of the
risk of low read quality in their distal end.
When it comes to quality-filtering, a major step is simultaneous
demultiplexing, trimming primer and index sequences and removal of
low-quality and non-target reads. Any ambiguous nucleotides and
mismatches to indexes or primers are indicative of potentially low read
quality and these reads could be excluded. Dual-indexed reads with
mismatching pairs or a missing index from one primer are indicative of
index switching and incomplete sequence data, respectively. The expected
length of full-length ITS ranges from 250 (someSaccharomycetales ) to around 1,500 bases (e.g., someCantharellales and various unicellular groups), butMicrosporidea may have only a few bases of rudimentary ITS
sequences. The ITS1 and ITS2 subregions taken separately vary from 50 to
around 1000 bases. There is also great length variation in 18S and 28S
rRNA genes, which is mostly ascribed to introns.
Some 16S-based workflows recommend removal of homopolymers
>6 bases (e.g., default in QIIME2), but the ITS region of
many fungal and other eukaryotic taxa commonly harbour homopolymers
exceeding 10 bases. Hence, homopolymer length should not be used as an
indicator in quality-filtering for ITS sequences and other non-coding
regions.
For ITS metabarcoding, it is important to remove flanking 18S and 28S
rRNA genes, because these conserved ends display no species-level
resolution and random errors in these regions complicate clustering
(Lindahl et al., 2013). Furthermore, chimeric breakpoints may be common
in these regions but are nearly impossible to recognize from such short
fragments (e.g., 10-70 bases). ITS extraction can be performed using
programs cutadapt (Martin et al., 2011), ITSx (Bengtsson-Palme et al.,
2013) or ITSxpress (Rivers et al., 2018). Cutadapt is relatively more
universal, as it recognizes custom oligonucleotides and cuts from a
user-defined distance. ITSx and ITSxpress cut out ITS1, ITS2 and
full-length ITS region based on kingdom-wide Hidden Markov Models
(HMMs).
Chimeric molecules are mainly generated in the excessive cycles of PCR
and are therefore nearly always less abundant than their parent
molecules (Sze & Schloss, 2019). They are usually represented by
singletons and doubletons restricted to a single sample. There are
multiple algorithms for chimera detection, of which UCHIME (Edgar et
al., 2011) is by far the most universal and widely used. It is
recommended to perform chimera filtering both in de novo and
reference-based modes, which compare OTUs against each other (in ranked
abundance within a sample) and against a reference database (e.g.,
UNITE), respectively. According to our experience, a vast majority of
reference-based chimeras are true chimeras, whereas ca. half of thede novo chimeras may be false positives, as suggested for mock
communities (Aas et al., 2017). Not all chimeras are detected by the
programs, so it may be advisable to remove all singletons or OTUs with
<5 or <10 reads in the case of deep sequencing.
Chimeric molecules usually have only a partial match to the reference
sequence (coverage 55-98%) at high sequence similarity
(>90%), which allows additional algorithm-based removal or
rare, potentially artifactual OTUs.
Index-switches (also known as tag-switches, index-jumping and index
cross-talk) is the most deleterious phenomenon in HTS data, because it
results in technical cross-contamination among samples and may blur
especially patterns in host specificity, taxon networks and
biogeographical pattern (Carlsen et al., 2012). Index switches occur
during PCR, T4 blunt ending and cross-pairing of amplicons from
different libraries, and they are known from all sequencing platforms
(Schnell et al., 2005; Carøe & Bohmann, 2020). Careful indexing of
samples ameliorates this issue, but roughly 0.01-0.1% of obvious
switches will remain. Index-switches can be assessed with an ad
hoc score using the UNCROSS algorithm (Edgar, 2017), the unspread
python script (Larsson et al., 2018), by tracking non-biological
spike-ins (Palmer et al., 2018), or by a positive control sample. Based
on the distribution of spike-ins or positive control in biological
samples in vice versa, index-switch rates can be quantified. Sequence
abundances above the index-switching threshold are converted to zero.
Index-switches and other contamination should be checked for each
sequencing library separately. It is also useful to estimate occurrence
of taxa in pseudosamples represented by unused indexes (see Taberlet et
al., 2018) to assess the proportion of index-switch artefacts.
Clustering
Clustering is used for aggregating reads into OTUs based on user-defined
sequence similarity thresholds, usually ranging from 95-100% sequence
similarity. There are multiple clustering algorithms, many of which are
based on a global alignment or expect equal read length, and are thus
unsuited for ITS data. Most algorithms enforce strict sequence
similarity and coverage thresholds, but some allow relaxed similarity
(e.g., swarm; Mahe et al., 2021) and do not consider end-gaps (indels in
the end; e.g., SCATA, https://scata.mykopat.slu.se). Some algorithms
allow truncation of homopolymers (e.g., the algorithms implemented in
SCATA and PipeCraft), which may be warranted for the PacBio, ONT and Ion
Torrent platforms where indels in homopolymers are the most common
sequencing errors.
Three types of clustering are commonly seen in metabarcoding efforts:
open-reference, closed-reference and de novo clustering. Inde novo clustering, the sequence data are clustered within the
project. For closed-reference clustering and open-reference clustering,
a reference database is used, but only in the latter method, unique
sequences are retained. Open-reference greedy clustering is recommended
as it provides the most stable OTUs by accounting for the taxonomic
structure in the reference data (He et al., 2015; Cline et al., 2017).
Following studies in bacterial ecology and Sanger sequencing-based
mycorrhizal ecology, a sequence similarity threshold of 97% is most
commonly used, especially for short reads. Other studies have found that
98% or 98.5% sequence similarity is a better compromise between
sequencing errors and biological differences among closely related
species and intraspecific variation (Tedersoo et al., 2014; Kyaschenko
et al., 2017). Yet, many closely related species of ascomycete
saprotrophs and pathogens differ by no or only a few bases in
full-length ITS (e.g., O’Donnell et al., 2015). Based on 18S rRNA gene
sequences, ascomycete and basidiomycete species belonging to different
orders may display identical or near-identical sequences, which render
SSU unsuited for fungal DNA metabarcoding. However, unicellular fungal
lineages and Glomeromycota display greater variation, although
species-level distinction in the latter group is not straightforward as
most virtual taxa and OTUs are >50 million years old (Bruns
& Taylor, 2016; Perez-Lamarque et al., 2020), which roughly corresponds
to the age of genera in most other fungal groups. To improve taxonomic
resolution and reproducibility of identified taxa, amplicon sequence
variant (ASV) approaches have been developed (Box 2).
Taxonomic assignments
A representative sequence is chosen from each OTU for taxonomic
annotation. Most programs select one of the longest sequences, the
consensus sequence or the most common sequence type for comparison to
the reference corpus. In most cases, the latter is biologically the most
meaningful. The longest sequence may contain artifactual insertions,
untrimmed fragments of flanking genes or represent an unrecognized
chimera. Consensus sequences are prone to lending voice to rare and
perhaps compromised sequence data.
There are different approaches for taxonomic assignments, all of which
require a well-curated reference database. The most common approach is
custom BLASTn searches (Camacho et al., 2009), where all representative
sequences are compared pairwise to sequences in the reference database.
Users can specify blast parameters from slow and stringent to rapid and
discontinuous. We recommend using word size <10 to be able to
obtain long query-to-template alignments and hence the most precise
estimates of e-value and sequence similarity. Alternative to BLASTn,
programs performing k-mer search such as Naïve Bayesian classifier (Wang
et al., 2007; Porras-Alfaro et al., 2014) and SINTAX (Edgar, 2016) are
up to 100-fold faster, but the resulting similarity estimates are not as
straightforward to interpret. These algorithms work well in situations
where reference data are abundant and accurately identified to species
level, which is a somewhat atypical situation seen to the fungal kingdom
at large (Lücking et al., 2021). Phylogenetic placement algorithms such
as EPA-ng (Barbera et al., 2018) map rRNA gene-based OTUs to
pre-established phylogenies, but these methods are not suited for the
ITS region due to its hypervariability in both length and composition.
Some pipelines (e.g., amptk) return taxonomic assessments based on
results from multiple algorithms. While often slightly more
conservative, this approach gives greater confidence in the assigned
taxonomy when there is clear congruence across different algorithms.
If the samples include many undescribed/unbarcoded species, we recommend
relying on blast search against the UNITE database, focusing on the best
hit but also accounting for 4-9 next best hits. Based on multiple global
datasets, we have developed recommended taxon-specific e-value and
sequence similarity thresholds for 18S-V9, ITS2 and full-ITS reads at
the level of genera to phyla (Tedersoo et al., 2014; 2021b; updated in
Table S1). Manual BLAST-based double-checking of taxonomic affiliations
of at least the, say, 50 largest OTUs/ASVs are a part of those
recommendations.
With respect to reference databases, UNITE is the largest by containing
curated data obtained from International Nucleotide Sequence Databases
consortium (INSDc) as well as data submitted directly to UNITE (Nilsson
et al., 2018). Furthermore, UNITE provides species hypotheses (SHs) for
ITS-based OTUs of fungi and other eukaryotes to enable unambiguous
DOI-based cross-communication of taxa among studies and across time
(Kõljalg et al., 2016, 2020). Another curated dataset is the so-called
Warcup training set, which covers a smaller set of well-identified fungi
of mostly plant-associated Ascomycota and Basidiomycota(Deshpande et al., 2016). We recommend identification of fungal and
other eukaryote ITS sequences based on the UNITE reference data set,
because it is the largest curated database and it includes multiple
non-fungal reads to facilitate separation of fungi from other eukaryotes
(Anslan et al., 2018). We recommend the SILVA database for 18S and 28S
rRNA gene reads (Quast et al., 2013). For Glomeromycota 18S and
28S rRNA gene reads, MaarjAM (Öpik et al., 2010) and the AM-LSU
pipeline (Delavaux et al., 2021) can be used, respectively.
Functional assignments
Functional assignments can be supplied directly to best-hitting
reference sequences and species, genera and orders based on their
identification. The FUNGuild database is largely focused on the genus
level (occasionally on the species level) and provides functional
assignment of lifestyle and life mode based on probabilistic estimates
(Nguyen et al., 2016). The FungalTraits database is focused on genus and
order-level functional estimates covering additional traits such as
fruiting body and hymenium type and capacities of performing certain
biotic functions (Põlme et al., 2020). FungalTraits also allows
complementary trait estimates (geographic distribution, isolation source
and mycorrhizal type) based on sequence accessions and SHs. FunFun
includes a large number of genomic and enzymatic functional traits, but
its taxonomic coverage is limited (Zanne et al., 2020).
From a functional perspective, we argue that it makes the most sense to
analyse diversity at the level of all fungi and major functional guilds
such as ectomycorrhizal, arbuscular mycorrhizal or putatively plant
pathogenic fungi. Phyla, classes and orders are of limited value,
because functionality of fungi is mainly conserved at the level of genus
and family (Zanne et al., 2020). Yet these ranks, particularly orders
that are generally well-delimited, are informative for understanding the
taxonomic structure. One notable challenge with functional assignments
becomes apparent when individual taxa are matched to multiple ecological
lifestyles (e.g., both saprotroph and endophyte). Lumping these taxa
into a “multi-guild” category for ecological analyses is not an
effective solution; instead it is recommended that users carefully
analyze the associated primary literature for those taxa and try to
determine which functional type is most likely represented based on the
study system or research question being addressed.
Curation of the OTU matrix
The LULU software can be used for assessing co-occurrences of closely
related taxa in the OTU-by-sample matrix (Froslev et al., 2017). This
program allows the removal of OTUs that potentially represent minority
haplotypes of common OTUs and remaining PCR and sequencing errors.
The OTU-by-sample matrix alone or tagged with sample metadata and
taxonomic and functional annotation requires some manual curation in a
spreadsheet program. Although many functions can be performed by Python
and Bash scripting, the process of checking taxonomic annotations based
on multiple best hits needs to be performed by manual browsing, because
the first hit may be incompletely annotated. Often, representative
sequences require BLASTn-based re-analysis against INSDc for taxonomic
determination or chimera control (Nilsson et al., 2012) or more complete
functional annotation (Fernandez et al., 2017). Manual BLASTn
examination of the 25-50 largest (most frequent or abundant) OTUs is a
good way to both identify compromised sequence data and to refine the
taxonomic annotation of OTUs. Implementing taxon-specific thresholds for
genera to phyla is easier done in a manual mode.
Checking the distribution of sequences in control samples and spill-over
of positive control in experimental samples is mostly manual work, but
relevant functions are implemented in the metabaR R package and its
associated tool vignette (Zinger et al., 2021). These steps are required
to estimate the rate of index-switching and require appropriate
measures. OTUs found in negative control samples should be assessed
carefully, because these may be derived from molecular reagents,
laboratory space or neighbouring samples (Eisenhofer et al., 2019; Loit
et al., 2019). There are several programs for removing/subtracting
contamination-affected OTUs from data cells (e.g., McKnight et al., 2019
and Zinger et al., 2021). However, among-sample cross-contamination not
affecting control samples (or if there are no controls) may be very
difficult to find. Ranking OTUs by abundance and inspecting Spearman
correlation matrices may be useful to detect cross-contamination. To
ensure recognition of contaminants, sample-specific spike-in molecules
inserted to extractable samples can be used and traced to reads
(Lagerborg et al., 2021).
Rarefaction is a commonly used option to standardise an OTU matrix to
equal sequencing depth based on random subsampling of reads. Samples are
typically rarefied to the lowest sequencing depth (after removing failed
or notably low abundance samples), but there is no consensus whether the
sequencing depth should reflect all reads, fungal reads or reads of each
functional group taken separately. It should be borne in mind that
sequencing depth per sample is not necessarily related to DNA content of
the original sample or biological abundance of certain taxa. The main
issue with rarefaction is the substantial loss of data that often
corresponds to 90% of the sequencing depth that may carry important
information about diversity (McMurdie & Holmes, 2014). Figure 3
illustrates that rarefaction provides relatively lower statistical power
in the multivariate analyses as well. Another possibility is rarefaction
to median sequencing depth (de Carcer et al., 2011) and accounting for
sparse samples in statistical models, or through modelling only (Weiss
et al., 2017). An alternative to rarefying is scaling with ranked
subsampling, which retains around 20% more OTUs (Beule & Karlovsky,
2020). Some technical methodological comparisons and non-covariate
models may still require rarefying.