Molecular analyses
DNA extraction
Prior to DNA extraction, it is important to homogenize the material by using a mortar and pestle, micropestles or bead beating in microcentrifuge tubes. In this process, keeping samples at a relatively cool temperature by using ice cubes or liquid nitrogen is also important. The required amount of material should be weighted to the DNA extraction tube and the rest could be stored for backup or for e.g., stable isotopes or chemistry analyses. It is usually undesirable to reach the full capacity of the DNA extraction kit, because several types of samples (e.g., peat soils, dead wood, plant-debris rich sediments and fleshy plant tissue) may absorb the liquid or co-extract inhibitors. For well-homogenized soil samples, there are only minor differences in richness when using DNA extracts from 0.25 g, 1 g or 10 g material (Song et al., 2015), but increasing the volume (replicate extractions or more material using ‘maxi’ kits) provides more reproducible estimates (Dickie et al., 2018). It is crucial to perform weighing and DNA extraction under a dedicated laminar flow in a room separated from the PCR lab to avoid cross-contamination and air contamination by amplicons. Such potential contaminants can also be detected and removed in downstream analyses through analysis of DNA extraction blank controls.
For DNA extraction, we recommend to follow the protocols elaborated for relevant substrates, either manual methods or commercial kits. The CTAB and phenol-chloroform protocols (multiple variants exist) are the most broadly used manual methods for obtaining large quantities of long DNA molecules. While the quantity of DNA from the aforementioned protocols is usually relatively large, it is often less pure than kit-based kits and so may require further dilution ahead of PCR to minimize the effect of inhibitors present in the sample (see below). Because of functional limitations in DNA extraction robots, the DNA purity and yield obtained with these protocols tend to be greater with analogous non-robot kits. As a rule of thumb, commercial non-robot and robot-based kits are roughly 2 and 5 times more time-efficient, but 2-10 times more costly compared with manual protocols.
Depending on the sample type and extraction method, the DNA may contain impurities that hamper PCR amplification. These can be overcome by pretreatment of samples during DNA extraction (e.g., Al3+ or Ca2+ flocculation of humic substances), purification using specific kits (e.g., polyvinylpolypyrrolidone spin columns against humic and fulvic acids in soil; universal Zymo Research OneStep PCR Inhibitor Removal Kit or Macherey-Nagel NucleoSpin Inhibitor Removal Kit against polyphenolics, humic and fulvic acids, tannins, and melanin) or equipment (e.g., SCODA electrophoresis), or precipitation with ethanol. Importantly, under most conditions, dilution of the DNA extracts may be sufficient to eliminate PCR inhibition (Wang et al., 2017). DNA concentrations can be increased by precipitation with ethanol, salts (e.g., sodium acetate) and carriers (e.g., Pellet Paint Co-Precipitant, glycogen, or linear polyacrylamide). If the DNA extract contains a large proportion of short fragments (e.g., degraded DNA due to poorly preserved samples or extracellular DNA), which hamper amplification and may promote chimera formation, these can be removed by elution using specific kits such as AMPure (Beckman Coulter Inc., www.beckman.com) and ProNex (Promega Corp.,www.promega.com). Extracellular DNA may account for 80% of total DNA, but it has little effect on estimates of diversity, as it comes from the dead cells of indigenous biota (Lennon et al., 2018). However, it is crucial to remove extracellular DNA for time series and co-occurrence analyses on a microscale (Lennon et al., 2018), which can be performed by sample treatment with ethidium or propidium monoazide (Wagner et al., 2008). Substrates destined for mesh-bag experiments (plant litter, wood) can be relieved from unnecessary microbial DNA with gamma-irradiation (Brabcova et al., 2016) or dry heating at >120 °С before exposure.
Control samples
Control samples – negative controls for sampling, DNA extraction and PCR, and positive controls (including mock communities) improve scientific reproducibility by offering means by which to estimate the accuracy of the analyses (reviewed in Zinger et al., 2019a). Negative and positive controls all inform about external and cross-contamination as well as potential index-switching (Carlsen et al., 2012, Esling et al., 2015). Amplification and sequencing of mock community analysis provide additional insights into the qualitative (i.e., estimation of PCR/sequencing error rates) and quantitative capacity (i.e., biased amplification) to retrieve the original diversity. Positive controls and mock communities may consist of artificial synthesised molecules or DNA extracts of actual species known not to occur in the experimental system (Ihrmark et al., 2012; Song et al., 2015). A sophisticated mock community should comprise >10 species with variable G+C content, amplicon length and quantity based on actual marker copy numbers. Additionally, due to index switching issues, consideration of the specific species composition of the mock community is desirable. Specifically, if the mock community contains the same taxa present in the samples being analyzed, it becomes impossible to determine whether any switched reads in the samples come from the mock community or not. One solution to this problem is to use a non-biological mock community, containing multiple “species” that are synthetically constructed to have properties equivalent to biological species but never that are present in nature (Palmer et al., 2018). Mock community analyses commonly fail to recover all species and usually reveal more OTUs than the input because of variable DNA quality, PCR bias, trace contamination and index-switching (Ihrmark et al., 2012; Bakker, 2018). Therefore, failing to recover the initial mock community does not necessarily indicate that the analyses have failed, but it sheds light on potential biases and serves as a reference to correct the data through bioinformatics processing.
Genetic markers
Obtaining high-quality amplicons is one of the most important steps in metabarcoding analyses. This can be ensured by selecting a suitable genetic marker, polymerase, relevant primers and appropriate thermocycling conditions. Each PCR run requires a negative control to rapidly detect contamination.
The ITS region of the rRNA cistron is the most broadly used marker for fungi in both DNA barcoding and metabarcoding analyses due to its multiple copy numbers, optimal species-level resolution in most groups and the possibility to design both fungal-specific and universal primers (Schoch et al., 2012; Nilsson et al., 2018). The ITS region is unsuited to target certain fungi such as Microsporidia (intracellular animal parasites) that may lack this fragment and certain Tulasnellaceae (orchid root symbionts) that have mutations in primer sites (Tedersoo et al., 2015, Rammitsu et al., 2021). Furthermore, ITS sequences lack variability in some species in certain species-rich genera comprising pathogens and saprotrophs such as Trichoderma andFusarium , and their analysis requires using additional taxonomic markers, typically protein-coding genes (O’Donnell et al., 2015; Cai & Druzhinina, 2021). The arbuscular mycorrhizal Glomeromycota have multinucleate hyphae with highly variable ITS copies, which has rendered the rRNA 28S and 18S gene fragments of broad use as well (Kolarikova et al., 2021). Because of sequencing read length limits imposed by second-generation HTS platforms, researchers have mostly focused on either the ITS1 or ITS2 subregion, which taken separately have lower taxonomic resolution and do not offer as suitable primer sites as the full region (Tedersoo et al., 2015; Tedersoo et al., 2021a).
For metabarcoding, ecologists use mostly primers designed decades ago for Sanger sequencing analyses (Figure 1). These original primers are not optimal for the many fungal groups that have one or more primer-template mismatches. They can be improved by adding degenerate positions to minimise primer bias (Tedersoo & Lindahl, 2016) and promote quantitative performance (Pinol et al., 2019). However, multiple degeneracies may require altering the 1:1 ratio of primers and may require extra PCR cycles, because not all variants match to templates. The broadly used fungus-specific forward primer ITS1F is particularly problematic because of several critical mismatches in certain groups of molds and putative animal pathogens (Tedersoo & Lindahl, 2016). Researchers should also consider the common presence of an intron at the end of 18S rRNA gene, which prevents sequencing of the taxa containing this intron (Figure 1). It may be important to pair primers with similar melting temperatures to secure optimal performance.
There are different amplicon library preparation strategies that require consideration during the primer design step (Figure 2). The metabarcoding primers may be equipped with both sample-specific index and platform-specific adapters for sequencing. The alternative strategy is to use shorter primers with only sample-specific indexes, which are ca 30-40% cheaper and easier to amplify, but require specific library preparation depending on the sequencing platform. Approaches requiring several PCR steps are also available (Figure 2; Bohmann et al., 2021), but these are more prone to contamination and chimera formation. Although vulnerable to contamination, the use of combinations of Illumina flow cell indices in the second PCR step enables ultra-high multiplexing of samples without index-switching bias (Holm et al., 2020).
The sample-specific indexes are typically 6-14 bases in length and differ from each other by at least 4 nucleotides (including indels) for error correction (Buschmann & Bystrykh, 2013). Their GC content should be in the range of 25-75% and homopolymers >2 nucleotides should be avoided. An example of >300 indexes is listed in Taberlet et al. (2018). To reduce amplification biases, there should be a 2-3-base linker between the index and PCR primer, which should not align to any of the targeted sequences. The quality of Illumina sequencing benefits from heterogeneity spacers added to the indexes (Figure 2; Fadrosh et al., 2014). To secure more equal library preparation, indexes should start with the same nucleotide. The same indexes (but not linkers) can be used with multiple primers, but each primer-index combination should be tested for hairpin structure formation in silico using, e.g. EcoPCR (Ficetola et al., 2010). Indexing both primers with unique tags is more expensive, but allows users to greatly reduce index-switching artefacts (Schnell et al., 2015), and is therefore strongly recommended.
Polymerases
With respect to DNA polymerases, it is important to select one with proofreading capacity in spite of their much greater cost. Proofreading polymerases have much-reduced error rates and therefore result in fewer spurious OTUs (Oliver et al., 2015, Bakker, 2018). The 3’ to 5’ exonuclease activity of proofreading polymerases performs primer editing in the last 4-6 nucleotide positions, reducing primer bias (Gohl et al., 2021). However, this activity varies by polymerase and the mismatching nucleotide and probably concentration of inhibitors (Gohl et al., 2021), and the effect on multiple near-terminal mismatches remains unexplored. Hence, proofreading polymerases may also strongly reduce the specificity of taxon-specific primers. Furthermore, the exonuclease activity of proof-reading polymerases creates multiple short fragments, especially at low ddNTP concentration and prolongs elongation times, which may result in more chimeras already at early stages of the PCR process (Ahn et al., 2012). For longer amplicons, it is crucial to select high-fidelity polymerases to secure amplification completion and hence reduce production of chimeric artefacts (Heeger et al., 2018). Thus, a wise selection of primers and polymerases allows researchers to obtain the same amount of high-quality data with lower sequencing depth.
Thermal cycling conditions
Regarding PCR cycling conditions, reducing annealing temperature may promote amplification of targeted taxa that have one or more primer-template mismatches, but it may also enhance non-specific priming, resulting in amplification of random genomic fragments or untargeted taxa. The number of PCR cycles should be kept at below 30 – optimally resulting in a weak band on a gel (Lindahl et al., 2013; D’Amore et al., 2016). Rather than losing samples with no visible amplicons, it is advised to add a few extra cycles to problematic samples, but users should keep in mind that these low-input or inhibitor-rich samples have elevated risk of contamination or biased diversity patterns (Eisenhofer et al., 2019). Adding BSA may be useful for improving amplification success, but this process may distort the retrieved community (Zaiko et al., 2021).
Biological samples may differ in several orders of magnitude in their DNA content, quality, and abundance of inhibitors. For PCR, the DNA content is rarely equalised, because typically 80-99% of eDNA is non-fungal, and the fungal fraction may vary significantly across samples (Tedersoo et al., 2015; Bahram et al., 2018). There is no consensus on whether or how the DNA quantity should be standardised, although diluted samples may yield a higher proportion of contaminants (Lindahl et al., 2013) as well as relatively lower diversity and greater variability (Castle et al., 2018, but see Song et al., 2015 and Wang et al., 2017). Therefore, at least two PCR replicates are needed to account for the stochasticity. Such technical replicates can be pooled for further analysis steps (Lindahl et al., 2013; Alberdi et al., 2018), but this pooling step will prevent evaluation of the PCR replication and exclusion of dysfunctional PCRs (Taberlet et al., 2018).
Alternatives to traditional eDNA amplicon-based methods
To focus on the active community, RNA instead of DNA can be used as a target for sequencing (Singer et al., 2017, but see Blazewicz et al., 2013 for limitations). One option is to amplify reverse transcribed cDNA, which can be performed for ITS sequences in spite of the short life of precursor RNA (Rajala et al., 2011). Interestingly, cDNA-based HTS reveals multiple taxa not recovered using DNA and vice versa (Rajala et al., 2011). Another option is direct RNA sequencing, which is currently provided only by ONT (Oxford Nanopore Technology; Garalde et al., 2018). Both methods produce more errors than state-of-the-art DNA-based methods. As both PacBio and ONT sequencing make it possible to record modified nucleotides such as various methylations, it may be possible to record various artificial nucleotide analogues (e.g., 3-bromo-deoxyuridine) incorporated into DNA in real time (Hanson, Allison, Bradford, Wallenstein, & Treseder, 2008; Georgieva et al., 2020). Stable isotope probing is widely used for bacteria because of their rapid metabolism of 13C-enriched substrates (Berry & Loy, 2018), but they have been little used in mycology, likely due to the high costs of enriched C (but see Hannula et al., 2017; Lopez-Mondejar et al., 2020). Nevertheless, RNA-based SIP applications may offer more promise in fungi than for bacteria (Singer et al., 2017; Ghori et al., 2019).
Metagenomics and metatranscriptomics can alternatively be used for large-scale identification of organisms. These methods are free from PCR biases but may be affected by library preparation biases and add an order of magnitude to the costs (Quince et al., 2017; Singer et al., 2017). While these methods work reasonably well on bacteria and viruses with small and densely packed genomes and for which a rich set of reference genomes are available, analyses of fungi and other eukaryotes are heavily biased because of highly different genome sizes, number of rRNA gene copies and the striking lack of reference genomes for many important groups (Geisen et al., 2015; Tedersoo et al., 2015). This may change very soon within the ongoing Earth Biogenome project (www.earthbiogenome.org) and use of taxonomically more informative long reads. It may also be possible to use targeted capture for rRNA genes or other taxonomically and functional genes to sequence these using long-read protocols (Witek et al., 2016), but the analytical costs are approximately five-fold the costs of regular metabarcoding.
As an alternative to taxon-specific primers, it is possible to use blocking protein-nucleic acid complexes (PNAs) or locked nucleic acid (LNA) oligonucleotides in conjunction with universal PCR primers (Vestheim et al., 2011). PNAs are widely used in metabarcoding analyses of plant-associated bacteria to block amplification and disable subsequent sequencing of plastid and mitochondrial DNA (Lundberg et al., 2013). Probably partly because of primer sites at the end of 18S rRNA gene that allow discrimination against plant amplicons, blocking elements have found limited use in metabarcoding of fungi (but see Ikenaga et al., 2016), although plant-specific motifs exist in all of the 18S, 5.8S, and 28S rRNA genes. Banos et al. (2018) developed protist-targeting PNAs for fungal communities in aquatic environments. The use of blocking elements requires optimisation of concentration and annealing temperature for each primer pair and polymerase used (Vestheim et al., 2011) and furthermore necessitates double-checking any shifts in the perceived diversity of fungi or other target organisms.
DNA library preparation
Among-sample variability of amplicon quantity is high at a low number of PCR cycles. Therefore, the amount of amplicons should be standardised for improved comparability of sequencing depths. This can be achieved by DNA capture on a solid phase with limited binding capacity (SequalPrep, Thermo Fisher Technologies), DNA content measurement and normalisation, or simple estimates of the band strength on agarose gel by eye.
The equimolarly pooled samples are subjected to library preparation using HTS platform-specific kits. Aside from multiple kits for Illumina, those free from amplification steps and biases of G+C content and fragment length are recommended (Bowers et al., 2015; Sato et al., 2019). Amplicons produced by different primers, even when of similar length, should not be mixed into the same library because of great differences in yield (Tedersoo et al., 2015). In-house library preparation may be up to 5-fold cheaper compared to commercial services.
Sequencing platforms
For metabarcoding, both the second-generation and third-generation platforms can be considered (reviewed in Tedersoo et al., 2021a). Currently, the second-generation platforms allow sequencing up to ca. 550 base pair markers, but their throughput exceeds that of third-generation platforms by 1-2 orders of magnitude and their costs per base are at least an order of magnitude lower. Given their relative accuracy, Illumina (HiSeq and NovaSeq instruments in 2 x 250 paired-end mode and MiSeq) and MGI-Tech (DNBSEQ-G400RS in 2 x 200 paired-end mode) are best suited for analyses of short barcodes such as ITS1, ITS2 or one or two variable regions combined within 18S and 28S rRNA genes.
The average raw read length of PacBio and ONT instruments exceeds 20 kb. The libraries of PacBio consist of circularised amplicons, which are sequenced multiple times (circular consensus sequencing; CCS) and error rates decrease from 10-15% to <0.1% at >10-fold consensus. This allows high-quality sequencing of up to 3.5 kb fragments that cover multiple markers at high quality. Such long reads offer much improved taxonomic resolution and allow rigorous phylogenetic analyses based on reasonably long alignments of conserved regions (Tedersoo et al., 2020b). Furthermore, random PCR and sequencing errors are typically ironed out during the clustering process (Tedersoo et al., 2018), and much of the relatively more degraded extracellular DNA is excluded.
Currently, ONT sequencing does not offer sufficient read quality for metabarcoding. Although unique molecular identifiers (UMIs) can be used in the generation of consensus sequences (Figure 2e; Karst et al., 2021), obtaining at least 20-fold consensus will reduce throughput and increase the overall cost tremendously. UMIs can also be used for producing synthetic long reads using any of the short-read platforms, which results in principally error-free long reads (Callahan et al., 2021). However, a new commercial LoopSeq service provided by Loop Genomics, Inc. (www.loopgenomics.com) is relatively costly (43-100 USD/sample). Taken together, the choice of HTS strategy depends on expected data quality, number of samples included, desired sequencing depth and amplicon length as well as available financial resources (Tedersoo et al., 2021a).