Molecular analyses
DNA extraction
Prior to DNA extraction, it is important to homogenize the material by
using a mortar and pestle, micropestles or bead beating in
microcentrifuge tubes. In this process, keeping samples at a relatively
cool temperature by using ice cubes or liquid nitrogen is also
important. The required amount of material should be weighted to the DNA
extraction tube and the rest could be stored for backup or for e.g.,
stable isotopes or chemistry analyses. It is usually undesirable to
reach the full capacity of the DNA extraction kit, because several types
of samples (e.g., peat soils, dead wood, plant-debris rich sediments and
fleshy plant tissue) may absorb the liquid or co-extract inhibitors. For
well-homogenized soil samples, there are only minor differences in
richness when using DNA extracts from 0.25 g, 1 g or 10 g material (Song
et al., 2015), but increasing the volume (replicate extractions or more
material using ‘maxi’ kits) provides more reproducible estimates (Dickie
et al., 2018). It is crucial to perform weighing and DNA extraction
under a dedicated laminar flow in a room separated from the PCR lab to
avoid cross-contamination and air contamination by amplicons. Such
potential contaminants can also be detected and removed in downstream
analyses through analysis of DNA extraction blank controls.
For DNA extraction, we recommend to follow the protocols elaborated for
relevant substrates, either manual methods or commercial kits. The CTAB
and phenol-chloroform protocols (multiple variants exist) are the most
broadly used manual methods for obtaining large quantities of long DNA
molecules. While the quantity of DNA from the aforementioned protocols
is usually relatively large, it is often less pure than kit-based kits
and so may require further dilution ahead of PCR to minimize the effect
of inhibitors present in the sample (see below). Because of functional
limitations in DNA extraction robots, the DNA purity and yield obtained
with these protocols tend to be greater with analogous non-robot kits.
As a rule of thumb, commercial non-robot and robot-based kits are
roughly 2 and 5 times more time-efficient, but 2-10 times more costly
compared with manual protocols.
Depending on the sample type and extraction method, the DNA may contain
impurities that hamper PCR amplification. These can be overcome by
pretreatment of samples during DNA extraction (e.g.,
Al3+ or Ca2+ flocculation of humic
substances), purification using specific kits (e.g.,
polyvinylpolypyrrolidone spin columns against humic and fulvic acids in
soil; universal Zymo Research OneStep PCR Inhibitor Removal Kit or
Macherey-Nagel NucleoSpin Inhibitor Removal Kit against polyphenolics,
humic and fulvic acids, tannins, and melanin) or equipment (e.g., SCODA
electrophoresis), or precipitation with ethanol. Importantly, under most
conditions, dilution of the DNA extracts may be sufficient to eliminate
PCR inhibition (Wang et al., 2017). DNA concentrations can be increased
by precipitation with ethanol, salts (e.g., sodium acetate) and carriers
(e.g., Pellet Paint Co-Precipitant, glycogen, or linear polyacrylamide).
If the DNA extract contains a large proportion of short fragments (e.g.,
degraded DNA due to poorly preserved samples or extracellular DNA),
which hamper amplification and may promote chimera formation, these can
be removed by elution using specific kits such as AMPure (Beckman
Coulter Inc., www.beckman.com)
and ProNex (Promega Corp.,www.promega.com). Extracellular
DNA may account for 80% of total DNA, but it has little effect on
estimates of diversity, as it comes from the dead cells of indigenous
biota (Lennon et al., 2018). However, it is crucial to remove
extracellular DNA for time series and co-occurrence analyses on a
microscale (Lennon et al., 2018), which can be performed by sample
treatment with ethidium or propidium monoazide (Wagner et al., 2008).
Substrates destined for mesh-bag experiments (plant litter, wood) can be
relieved from unnecessary microbial DNA with gamma-irradiation (Brabcova
et al., 2016) or dry heating at >120 °С before exposure.
Control samples
Control samples – negative controls for sampling, DNA extraction and
PCR, and positive controls (including mock communities) improve
scientific reproducibility by offering means by which to estimate the
accuracy of the analyses (reviewed in Zinger et al., 2019a). Negative
and positive controls all inform about external and cross-contamination
as well as potential index-switching (Carlsen et al., 2012, Esling et
al., 2015). Amplification and sequencing of mock community analysis
provide additional insights into the qualitative (i.e., estimation of
PCR/sequencing error rates) and quantitative capacity (i.e., biased
amplification) to retrieve the original diversity. Positive controls and
mock communities may consist of artificial synthesised molecules or DNA
extracts of actual species known not to occur in the experimental system
(Ihrmark et al., 2012; Song et al., 2015). A sophisticated mock
community should comprise >10 species with variable G+C
content, amplicon length and quantity based on actual marker copy
numbers. Additionally, due to index switching issues, consideration of
the specific species composition of the mock community is desirable.
Specifically, if the mock community contains the same taxa present in
the samples being analyzed, it becomes impossible to determine whether
any switched reads in the samples come from the mock community or not.
One solution to this problem is to use a non-biological mock community,
containing multiple “species” that are synthetically constructed to
have properties equivalent to biological species but never that are
present in nature (Palmer et al., 2018). Mock community analyses
commonly fail to recover all species and usually reveal more OTUs than
the input because of variable DNA quality, PCR bias, trace contamination
and index-switching (Ihrmark et al., 2012; Bakker, 2018). Therefore,
failing to recover the initial mock community does not necessarily
indicate that the analyses have failed, but it sheds light on potential
biases and serves as a reference to correct the data through
bioinformatics processing.
Genetic markers
Obtaining high-quality amplicons is one of the most important steps in
metabarcoding analyses. This can be ensured by selecting a suitable
genetic marker, polymerase, relevant primers and appropriate
thermocycling conditions. Each PCR run requires a negative control to
rapidly detect contamination.
The ITS region of the rRNA cistron is the most broadly used marker for
fungi in both DNA barcoding and metabarcoding analyses due to its
multiple copy numbers, optimal species-level resolution in most groups
and the possibility to design both fungal-specific and universal primers
(Schoch et al., 2012; Nilsson et al., 2018). The ITS region is unsuited
to target certain fungi such as Microsporidia (intracellular animal
parasites) that may lack this fragment and certain Tulasnellaceae
(orchid root symbionts) that have mutations in primer sites (Tedersoo et
al., 2015, Rammitsu et al., 2021). Furthermore, ITS sequences lack
variability in some species in certain species-rich genera comprising
pathogens and saprotrophs such as Trichoderma andFusarium , and their analysis requires using additional taxonomic
markers, typically protein-coding genes (O’Donnell et al., 2015; Cai &
Druzhinina, 2021). The arbuscular mycorrhizal Glomeromycota have
multinucleate hyphae with highly variable ITS copies, which has rendered
the rRNA 28S and 18S gene fragments of broad use as well (Kolarikova et
al., 2021). Because of sequencing read length limits imposed by
second-generation HTS platforms, researchers have mostly focused on
either the ITS1 or ITS2 subregion, which taken separately have lower
taxonomic resolution and do not offer as suitable primer sites as the
full region (Tedersoo et al., 2015; Tedersoo et al., 2021a).
For metabarcoding, ecologists use mostly primers designed decades ago
for Sanger sequencing analyses (Figure 1). These original primers are
not optimal for the many fungal groups that have one or more
primer-template mismatches. They can be improved by adding degenerate
positions to minimise primer bias (Tedersoo & Lindahl, 2016) and
promote quantitative performance (Pinol et al., 2019). However, multiple
degeneracies may require altering the 1:1 ratio of primers and may
require extra PCR cycles, because not all variants match to templates.
The broadly used fungus-specific forward primer ITS1F is particularly
problematic because of several critical mismatches in certain groups of
molds and putative animal pathogens (Tedersoo & Lindahl, 2016).
Researchers should also consider the common presence of an intron at the
end of 18S rRNA gene, which prevents sequencing of the taxa containing
this intron (Figure 1). It may be important to pair primers with similar
melting temperatures to secure optimal performance.
There are different amplicon library preparation strategies that require
consideration during the primer design step (Figure 2). The
metabarcoding primers may be equipped with both sample-specific index
and platform-specific adapters for sequencing. The alternative strategy
is to use shorter primers with only sample-specific indexes, which are
ca 30-40% cheaper and easier to amplify, but require specific library
preparation depending on the sequencing platform. Approaches requiring
several PCR steps are also available (Figure 2; Bohmann et al., 2021),
but these are more prone to contamination and chimera formation.
Although vulnerable to contamination, the use of combinations of
Illumina flow cell indices in the second PCR step enables ultra-high
multiplexing of samples without index-switching bias (Holm et al.,
2020).
The sample-specific indexes are typically 6-14 bases in length and
differ from each other by at least 4 nucleotides (including indels) for
error correction (Buschmann & Bystrykh, 2013). Their GC content should
be in the range of 25-75% and homopolymers >2 nucleotides
should be avoided. An example of >300 indexes is listed in
Taberlet et al. (2018). To reduce amplification biases, there should be
a 2-3-base linker between the index and PCR primer, which should not
align to any of the targeted sequences. The quality of Illumina
sequencing benefits from heterogeneity spacers added to the indexes
(Figure 2; Fadrosh et al., 2014). To secure more equal library
preparation, indexes should start with the same nucleotide. The same
indexes (but not linkers) can be used with multiple primers, but each
primer-index combination should be tested for hairpin structure
formation in silico using, e.g. EcoPCR (Ficetola et al., 2010).
Indexing both primers with unique tags is more expensive, but allows
users to greatly reduce index-switching artefacts (Schnell et al.,
2015), and is therefore strongly recommended.
Polymerases
With respect to DNA polymerases, it is important to select one with
proofreading capacity in spite of their much greater cost. Proofreading
polymerases have much-reduced error rates and therefore result in fewer
spurious OTUs (Oliver et al., 2015, Bakker, 2018). The 3’ to 5’
exonuclease activity of proofreading polymerases performs primer editing
in the last 4-6 nucleotide positions, reducing primer bias (Gohl et al.,
2021). However, this activity varies by polymerase and the mismatching
nucleotide and probably concentration of inhibitors (Gohl et al., 2021),
and the effect on multiple near-terminal mismatches remains unexplored.
Hence, proofreading polymerases may also strongly reduce the specificity
of taxon-specific primers. Furthermore, the exonuclease activity of
proof-reading polymerases creates multiple short fragments, especially
at low ddNTP concentration and prolongs elongation times, which may
result in more chimeras already at early stages of the PCR process (Ahn
et al., 2012). For longer amplicons, it is crucial to select
high-fidelity polymerases to secure amplification completion and hence
reduce production of chimeric artefacts (Heeger et al., 2018). Thus, a
wise selection of primers and polymerases allows researchers to obtain
the same amount of high-quality data with lower sequencing depth.
Thermal cycling conditions
Regarding PCR cycling conditions, reducing annealing temperature may
promote amplification of targeted taxa that have one or more
primer-template mismatches, but it may also enhance non-specific
priming, resulting in amplification of random genomic fragments or
untargeted taxa. The number of PCR cycles should be kept at below 30 –
optimally resulting in a weak band on a gel (Lindahl et al., 2013;
D’Amore et al., 2016). Rather than losing samples with no visible
amplicons, it is advised to add a few extra cycles to problematic
samples, but users should keep in mind that these low-input or
inhibitor-rich samples have elevated risk of contamination or biased
diversity patterns (Eisenhofer et al., 2019). Adding BSA may be useful
for improving amplification success, but this process may distort the
retrieved community (Zaiko et al., 2021).
Biological samples may differ in several orders of magnitude in their
DNA content, quality, and abundance of inhibitors. For PCR, the DNA
content is rarely equalised, because typically 80-99% of eDNA is
non-fungal, and the fungal fraction may vary significantly across
samples (Tedersoo et al., 2015; Bahram et al., 2018). There is no
consensus on whether or how the DNA quantity should be standardised,
although diluted samples may yield a higher proportion of contaminants
(Lindahl et al., 2013) as well as relatively lower diversity and greater
variability (Castle et al., 2018, but see Song et al., 2015 and Wang et
al., 2017). Therefore, at least two PCR replicates are needed to account
for the stochasticity. Such technical replicates can be pooled for
further analysis steps (Lindahl et al., 2013; Alberdi et al., 2018), but
this pooling step will prevent evaluation of the PCR replication and
exclusion of dysfunctional PCRs (Taberlet et al., 2018).
Alternatives to traditional eDNA amplicon-based methods
To focus on the active community, RNA instead of DNA can be used as a
target for sequencing (Singer et al., 2017, but see Blazewicz et al.,
2013 for limitations). One option is to amplify reverse transcribed
cDNA, which can be performed for ITS sequences in spite of the short
life of precursor RNA (Rajala et al., 2011). Interestingly, cDNA-based
HTS reveals multiple taxa not recovered using DNA and vice versa (Rajala
et al., 2011). Another option is direct RNA sequencing, which is
currently provided only by ONT (Oxford Nanopore Technology; Garalde et
al., 2018). Both methods produce more errors than state-of-the-art
DNA-based methods. As both PacBio and ONT sequencing make it possible to
record modified nucleotides such as various methylations, it may be
possible to record various artificial nucleotide analogues (e.g.,
3-bromo-deoxyuridine) incorporated into DNA in real time (Hanson,
Allison, Bradford, Wallenstein, & Treseder, 2008; Georgieva et al.,
2020). Stable isotope probing is widely used for bacteria because of
their rapid metabolism of 13C-enriched substrates (Berry & Loy, 2018),
but they have been little used in mycology, likely due to the high costs
of enriched C (but see Hannula et al., 2017; Lopez-Mondejar et al.,
2020). Nevertheless, RNA-based SIP applications may offer more promise
in fungi than for bacteria (Singer et al., 2017; Ghori et al., 2019).
Metagenomics and metatranscriptomics can alternatively be used for
large-scale identification of organisms. These methods are free from PCR
biases but may be affected by library preparation biases and add an
order of magnitude to the costs (Quince et al., 2017; Singer et al.,
2017). While these methods work reasonably well on bacteria and viruses
with small and densely packed genomes and for which a rich set of
reference genomes are available, analyses of fungi and other eukaryotes
are heavily biased because of highly different genome sizes, number of
rRNA gene copies and the striking lack of reference genomes for many
important groups (Geisen et al., 2015; Tedersoo et al., 2015). This may
change very soon within the ongoing Earth Biogenome project
(www.earthbiogenome.org) and use of taxonomically more informative long
reads. It may also be possible to use targeted capture for rRNA genes or
other taxonomically and functional genes to sequence these using
long-read protocols (Witek et al., 2016), but the analytical costs are
approximately five-fold the costs of regular metabarcoding.
As an alternative to taxon-specific primers, it is possible to use
blocking protein-nucleic acid complexes (PNAs) or locked nucleic acid
(LNA) oligonucleotides in conjunction with universal PCR primers
(Vestheim et al., 2011). PNAs are widely used in metabarcoding analyses
of plant-associated bacteria to block amplification and disable
subsequent sequencing of plastid and mitochondrial DNA (Lundberg et al.,
2013). Probably partly because of primer sites at the end of 18S rRNA
gene that allow discrimination against plant amplicons, blocking
elements have found limited use in metabarcoding of fungi (but see
Ikenaga et al., 2016), although plant-specific motifs exist in all of
the 18S, 5.8S, and 28S rRNA genes. Banos et al. (2018) developed
protist-targeting PNAs for fungal communities in aquatic environments.
The use of blocking elements requires optimisation of concentration and
annealing temperature for each primer pair and polymerase used (Vestheim
et al., 2011) and furthermore necessitates double-checking any shifts in
the perceived diversity of fungi or other target organisms.
DNA library preparation
Among-sample variability of amplicon quantity is high at a low number of
PCR cycles. Therefore, the amount of amplicons should be standardised
for improved comparability of sequencing depths. This can be achieved by
DNA capture on a solid phase with limited binding capacity (SequalPrep,
Thermo Fisher Technologies), DNA content measurement and normalisation,
or simple estimates of the band strength on agarose gel by eye.
The equimolarly pooled samples are subjected to library preparation
using HTS platform-specific kits. Aside from multiple kits for Illumina,
those free from amplification steps and biases of G+C content and
fragment length are recommended (Bowers et al., 2015; Sato et al.,
2019). Amplicons produced by different primers, even when of similar
length, should not be mixed into the same library because of great
differences in yield (Tedersoo et al., 2015). In-house library
preparation may be up to 5-fold cheaper compared to commercial services.
Sequencing platforms
For metabarcoding, both the second-generation and third-generation
platforms can be considered (reviewed in Tedersoo et al., 2021a).
Currently, the second-generation platforms allow sequencing up to ca.
550 base pair markers, but their throughput exceeds that of
third-generation platforms by 1-2 orders of magnitude and their costs
per base are at least an order of magnitude lower. Given their relative
accuracy, Illumina (HiSeq and NovaSeq instruments in 2 x 250 paired-end
mode and MiSeq) and MGI-Tech (DNBSEQ-G400RS in 2 x 200 paired-end mode)
are best suited for analyses of short barcodes such as ITS1, ITS2 or one
or two variable regions combined within 18S and 28S rRNA genes.
The average raw read length of PacBio and ONT instruments exceeds 20 kb.
The libraries of PacBio consist of circularised amplicons, which are
sequenced multiple times (circular consensus sequencing; CCS) and error
rates decrease from 10-15% to <0.1% at >10-fold
consensus. This allows high-quality sequencing of up to 3.5 kb fragments
that cover multiple markers at high quality. Such long reads offer much
improved taxonomic resolution and allow rigorous phylogenetic analyses
based on reasonably long alignments of conserved regions (Tedersoo et
al., 2020b). Furthermore, random PCR and sequencing errors are typically
ironed out during the clustering process (Tedersoo et al., 2018), and
much of the relatively more degraded extracellular DNA is excluded.
Currently, ONT sequencing does not offer sufficient read quality for
metabarcoding. Although unique molecular identifiers (UMIs) can be used
in the generation of consensus sequences (Figure 2e; Karst et al.,
2021), obtaining at least 20-fold consensus will reduce throughput and
increase the overall cost tremendously. UMIs can also be used for
producing synthetic long reads using any of the short-read platforms,
which results in principally error-free long reads (Callahan et al.,
2021). However, a new commercial LoopSeq service provided by Loop
Genomics, Inc.
(www.loopgenomics.com) is
relatively costly (43-100 USD/sample). Taken together, the choice of HTS
strategy depends on expected data quality, number of samples included,
desired sequencing depth and amplicon length as well as available
financial resources (Tedersoo et al., 2021a).