Statistical data analyses
Alpha diversity metrics
Microbial diversity has been traditionally studied with OTU richness as
well as Shannon and Simpson diversity indices and the effective species
numbers (Hill numbers) derived from these (Chao et al., 2014; Alberdi &
Gilbert, 2019). Diversity indices down-weigh the effect of rare OTUs and
therefore only weakly correlate with sequencing depth. OTU richness has
a cumulating function with sequencing depth that is particularly
prominent in diverse, pooled samples. Unless rarefaction is performed,
it is important to include square-root or log-transformed sequencing
depth, whichever is more informative, as a covariate. Although commonly
used, we do not recommend OTU richness extrapolations, because these
rely on the abundance of the rarest OTUs, which are commonly artefactual
(Bunge et al., 2014; Balint et al., 2016).
Adding phylogenetic information to taxonomic composition eliminates the
uncertainty regarding OTU calculations (Washburne et al., 2018) and
could also reduce the effect of remaining PCR/sequencing errors in the
data (Taberlet et al 2018). However, phylogenetic composition of fungal
communities has rarely been assessed (Horn et al., 2014), because the
ITS region is not amenable to robust multiple alignments and
phylogenetic reconstruction much beyond the genus level. There are
alternative methods available such as grafting phylogenies (Fouquier et
al., 2016) and using taxonomic ranks (Tedersoo et al., 2018,
Chalmandrier et al., 2019), which may offer some additional insights.
Testing phylogenetic conservatism, overdispersion and turnover across
phylogenetic scales (Tucker et al., 2016) may also be informative in
analyses of plant-fungal interactions (Chalmandrier et al., 2019),
fungal community assembly processes (Roy et al., 2019) and
phylogeographic patterns (Turon et al., 2020). These statistics can be
calculated in phylocom (Webb et al., 2008), the R packages picante
(Kembel et al., 2010) and S.phylomaker (Qian & Jin, 2016) as well as
other open-access scripts (Stegen et al., 2015; Ning et al., 2019;
Chalmandrier et al., 2019).
Functional data, derived from genomics, metagenomics,
metatranscriptomics or functional annotation of OTUs complements
information about the taxonomic composition (Laliberte & Legendre,
2010; Egidi et al., 2019; Escalas et al., 2019) because of widespread
redundancy and mostly family-level phylogenetic autocorrelation in
fungal functional profiles (Zanne et al., 2020). In functional diversity
measures, it is important to consider within-function diversity rather
than richness or relative abundance in a certain function (Escalas et
al., 2019).
Statistical methods
HTS analyses produce semiquantitative abundance data for OTUs, and many
rare taxa remain beyond the detection level. Fungal species differ
greatly in the number of cells per unit biomass, the number of nuclei
per cell and the number of ITS copies per nucleus (Lofgren et al.,
2019). The use of quantitative methods assumes that all samples are
influenced by these biases in a similar manner.
Because the sequencing data are compositional and reflect relative
rather than absolute abundances (Gloor et al., 2017; Lin & Peddada,
2020b), it may be necessary to transform the entire OTU table
accordingly for testing factors affecting community shifts (this
excludes null model based approaches, e.g. as in Stegen et al., 2015,
which often rely on a sampling process of reads/individuals). Such
transformations can be performed by using additive or modified centered
log-ratio transformation (Quinn et al., 2019; Yoon et al., 2019; Ha et
al., 2020), although it should be kept in mind that these
transformations have limits when the data are sparse (Lovell et al.,
2020). Alternatively, programs for compositional data analyses (e.g.,
ANCOM-BC; Lin & Peddada, 2020a) and the Aitchison distance metric can
be used (Quinn et al., 2019). With appropriate transformations,
different methods reveal roughly comparable results (Figure 3).
In large datasets, our experience is that log-transformation of fungal
OTU richness yields better-explained statistical models. Classical
transformations of metadata such as logarithmic (concentrations),
square-root (counts) and log-ratio transformations (proportions) are
necessary to shift the distribution of residuals towards normal
distribution and reduce heteroscedasticity – principal assumptions of
most parametric tests (Legendre & Legendre, 2012). Log-ratio
transformation also increases the independence of measurements of
various ratios summing up 100%.
Phylogenetic relationships among associated plants, metazoans, or other
organisms can be accounted for by testing phylogenetic signals or
explicitly quantifying the phylogeny effect by using eigenvectors (e.g.,
adespatial package of R; Dray et al., 2018). Compared with using pure
phylogenetic distances, the use of eigenvectors has proved to extract
larger proportions of plant effects on fungal communities (Tedersoo et
al., 2013; Yang et al., 2019). Similar eigenvector approaches have also
been used to quantify and account for spatial and temporal and plant
phylogeny effects on fungal diversity (Zimmerman & Vitousek, 2012;
Tedersoo et al., 2020a).
For statistical analyses, most of the useful methods that have been
elaborated in plant and microbial ecology are applicable for fungi, too
(for review, see Buttigieg & Ramette, 2014; Balint et al., 2016;
Hugerth & Andersson, 2017). In particular, the R packages vegan
(Oksanen et al., 2019), phyloseq (McMurdie & Holmes, 2013) and microeco
(Liu et al., 2021) are useful for routine analyses besides commercial
statistical platforms.
When it comes to large amounts of metadata with potential
multicollinearity, it is feasible to pre-select the most important
variables by machine learning algorithms (Qu et al., 2019) such as
random forest implemented in the randomforest (Liaw and Wiener, 2002)
and VSURF (Genuer et al., 2019) R packages with subsequent model
selection based on forward selection or the AICc information criterion
(e.g., nlme R package; Pinheiro et al., 2011; mumin R package). The
party package allows estimating interactions among two and more
variables using machine learning (Strobl et al., 2009), which can be
used to assess conditional and synergistic effects among variables
(Rillig et al., 2019). For variables expected to exhibit unimodal (e.g.,
pH over a broad gradient) or cumulative (e.g., host richness and
rainfall) relationships, inclusion of second-order polynomials or
fitting generalized additive models (GAMs; mgcv package of R; Pedersen
et al., 2019) may be more appropriate.
For taxon-level analyses such as distribution modelling of OTUs, random
forest and univariate models are appropriate, but nonparametric tests
should be used because of multiple zero-values. In addition, indicator
species analyses can be used as implemented in the indicspecies R
package (Caceres & Legendre, 2009). Specialist and generalist features
of OTUs can be tested in two community types using CLAM, which
outperforms other indicator statistics (Chazdon et al., 2011). Another
option is to decompose multivariate ANOVA models into OTU contributions,
but this works properly only for unifactorial analyses (Ricotta et al.,
2021).
For multivariate analyses, permutational multivariate ANOVA (permanova)
is currently the state of the art in that it outperforms common
ordination methods because of its explicit hypothesis testing (Anderson,
2001). The program PERMANOVA+ included in the Primer6/Primer7 packages
(Anderson et al., 2008) offers more functionalities than the adonis
routine of the vegan package. With respect to ordination methods,
nonmetric multidimensional scaling (NMDS), principal coordinates
analysis (PCoA) and redundancy analysis (RDA) are among the most popular
choices for two-dimensional visualisation of multivariate patterns
(Paliy & Shankar, 2016). In particular, the ordiellipse function in the
vegan package supports analysis and display of within-group variance.
General dissimilarity modelling (GDM) facilitates the testing of
non-linear effects of multiple variables, which helps the user to
identify and understand critical biological thresholds (gdm R package;
Manion et al., 2018). Similar multivariate analyses can also be
performed for phylogenetic and functional data. All multivariate
analyses are sensitive to selection of transformation and distance
(dissimilarity) measures (Kuczynski et al., 2010; Buttigieg & Ramette,
2014)). Hellinger transformation (square-root function) and Bray-Curtis
dissimilarity are among the best choices for quantitative data (Legendre
& Gallagher, 2001). We do not recommend transforming the data to
presences and absences to avoid upweighting rare taxa that may be
artefacts.
Network analysis has become increasingly popular in microbial ecology
(Faust & Raes, 2012; Mikryukov et al., 2021). Unipartite networks
(co-occurrence analyses) are often used to infer positive (mutualism and
facilitation) and negative (avoidance and competition) interactions
among co-occurring taxa (Weiss et al., 2016). Bipartite networks
estimate partner specificity and how these specific and non-specific
taxa are distributed (R package bipartite; Dormann et al., 2009). Both
approaches also provide information on the modularity, nestedness and
connectivity of the studied ecological network. However, both types of
networks have been commonly overinterpreted: unipartite networks may
yield incorrect implications based on compositional data (Rao et al.,
2021), fail to recover negative associations and exhibit a topology that
is sensitive to both network inference techniques and the number of
samples considered (Weiss et al., 2016, Matchado et al., 2021);
occasional presences are commonly misinterpreted as intimate
associations in bipartite networks, particularly for the plant and
root-associated fungal relationships. From a more theoretical
perspective, it has also been shown that spatial co-occurrences poorly
reflect ecological interactions (Blanchet et al., 2020). For addressing
biologically meaningful facilitation and avoidance strategies among
species, the samples should be unpooled and of relevant size to ensure
direct contact among organisms. Submillimeter scale is the most relevant
for assessing fungal interactions with other fungi and bacteria in an
abiotic environment. There are many network construction algorithms
(Weiss et al., 2016, Matchado et al., 2021) for estimating relationship
strength with correlation measures, indices of dissimilarity between
species pairs, proportionality, or measures of conditional dependence
omitting indirect connections or constructing consensus networks based
on different measures (e.g., CoNet; Faust et al., 2016). Some methods
take into account data compositionality and sparsity, for example SparCC
(Friedman & Alm, 2011), SPIEC-EASI (Kurtz et al., 2015) and SPRING
(Yoon et al., 2019)). To reduce the number of false positive
associations, it is recommended to reduce matrix size (rare taxa) to
reach a fill level of ca. 50% (Weiss et al., 2016). The main issue with
comparing networks is related to their lack of true replication (Bahram
et al., 2014) and dependence of results on the linkage metric, filtering
threshold and network construction algorithm (Weiss et al., 2016; Connor
et al., 2017). Nonetheless, consistent associations and integral
topological parameters usually remain unaffected (Toju et al., 2015;
Röttjers & Faust, 2018).
Structural equation modelling (SEM), in particular path analysis, tests
the directionality as well as direct and indirect effects among
variables (Fan et al., 2016; Collier, 2020). These are important to
consider when the explanatory variables affect each other (e.g.,
vegetation and soil) or there are several related response variables
(e.g., richness of different functional guilds; Yang et al., 2021).
Nonetheless, the causal relationship identified by SEMs models strongly
relies on the hypothetical causalities tested, which should hence be
properly justified by other empirical observations or theoretical
foundations. In addition, SEM models have several commonly ignored
assumptions: multivariate normality, linear associations, no missing
data, no multicollinearity and large sample size – at least 20 samples
per variable in the model (Collier, 2020). Taxonomic composition can
also be included in SEMs as principal components (Antoninka et al.,
2009). The program AMOS (www.ibm.com) and the R package piecewiseSEM
(Lefcheck, 2016) offer most functionalities needed for such analyses.
It needs to be remembered that in any ecological tests, the most
biologically informative information to the readers are the effect size
(either Z-score or adjusted R2) as well as direction
and shape of the relationship. Estimates of variance and sample size are
also required for conducting meta-analyses. Values of statistical tests
and P -values are context-dependent albeit necessary to report for
assessing the validity of the analysis. The rise of type I error rate in
multiple testing needs to be accounted for by using flexible correction
methods such as Benjamini-Hochberg correction (Noble, 2009).
Absolute and relative abundance
Absolute abundance of bacteria and fungi (as biomass) is an important
ecological measure, which indicates ecosystem nutrient cycles. In
addition, biomass estimates are important for correctly addressing
differences in composition (see below). As fungal biomass cannot be
directly measured, there are several proxies for its estimation, notably
phospholipid fatty acid (PLFA) profiles, ergosterol abundance, qPCR, and
metagenome composition associated with total DNA measurements (Baldrian
et al., 2013; Bahram et al., 2018; Yang et al., 2020). However,
unicellular fungi are poorly represented in these biomass estimates
because of different analogous compounds (Weete et al., 2010), primer
biases and the general paucity of reference genome data (Tedersoo et
al., 2015). Of the available options, qPCR is the simplest and least
laborious method, and it is expected to correlate the strongest with HTS
data since comparisons are related to the same DNA extracts. All these
methods assume that DNA, sterols and PLFAs are extracted from various
samples at equal efficiency, but this cannot be guaranteed for different
soils, plant species or tissues.
An alternative is to use a spike-in method, which relies on adding
artificial molecules or cells of one or preferably more species that are
not expected to occur in biological samples, before DNA extraction
(Palmer et al., 2018; Rao et al., 2021). For absolute quantification, it
is required that the number of rRNA gene copies of the spike-in taxa is
known. Index switches may disproportionally elevate the estimates of
spike-in abundance in samples with a high DNA content.
Visualisation of results
There are a large number of methods and software tools available for
visualising the statistical results, box plots and scatter plots being
the most commonly used. Violin plots are a specific type of box plots
that indicate distribution of measurements and deviation from normality.
Rarefaction curves (smoothed species accumulation curves) are useful for
graphical comparisons of species evenness and richness but also
evaluating sufficiency of sampling and sequencing depth within and among
samples (Colwell et al., 2004). Heat maps are useful for visualising
large correlation matrices or the results of multiple multifactorial
analyses (e.g., ClustVis web tool; Metsalu & Vilo, 2015), although the
typical lack of within-treatment variation can make their ecological
interpretation challenging. Venn diagrams are useful for showing unique
and shared variation among factors or OTUs across factor levels or
combinations (e.g., venny; Oliveros, 2007), and rank-abundance plots can
provide useful information on what specific taxa underlie certain
ecological patterns. Similarly, overlying environmental vectors on
ordination plots can help to identify the abiotic and biotic variables
that are associated with specific taxon or sample abundances.
The overall taxonomic composition is best visualised in stacked plots,
which makes it easier to display multiple treatments. However, as error
bars are lacking (there is no space), such plots and heat maps are
examples of implicit pseudoreplication. Krona charts (Ondov et al.,
2011) and heat trees (metacoder package of R; Foster et al., 2017)
provide an efficient way of demonstrating the distribution of dominant
taxa by taxonomic ranks (e.g., Nilsson et al., 2017). Because of high
space requirements, the overall view or comparison of up to two levels
of a treatment can be effectively indicated for Krona charts, but
interactive versions can be provided as supplementary items. Heat trees
can handle two samples or a single gradient.
While phylogenetic trees are generally too large for visualising
taxonomic affiliation of OTUs in the main article, these are well suited
to supplementary items. Although family-level and higher-level
phylogenetic relationships cannot always be assessed based on ITS data,
these are suitable for demonstrating rough phylogenetic placement.
Phylogenetic trees make more sense for 18S and 28S rRNA gene data and
are furthermore helpful in detecting artifactual OTUs based on
ultra-long branches or branches with zero length next to a long-branch
taxon (Tedersoo et al., 2020b). Large circular phylogenies squeezed into
ca. 17 cm page width offers limited opportunities for interpretation if
taxon names are unreadable and branch support values are lacking. Using
iTOL (interactive tree of life; https://itol.embl.de/), more than 50,000
taxa in phylogenetic trees can be equipped with large amounts of
metadata for display in supplementary materials (Letunic & Bork, 2021).
Data management
Funding requests should be written to cover not only the field and
laboratory parts, but also data processing, metadata annotation, and
public deposition and dissemination. Fungal metabarcoding data should be
submitted to any public archive such as short read archive (SRA),
European nucleotide archive (ENA) or UNITE. UNITE calculates
100%-similarity OTUs for full-length ITS sequences for each biological
sample that are used for further reference and incorporated in SHs.
Direct submission of representative sequences to INSDc is discouraged,
because these poorly annotated and commonly low-quality data hamper
further analyses. We also ask researchers to upload their community
matrices, metadata and demultiplexing information (if relevant) in
public repositories or supplementary materials, which enables other
users to perform meta-analyses and databasing (Põlme et al., 2020;
Vetrovsky et al., 2020). Figshare (https://figshare.com/) and Dryad
(https://datadryad.org/) are among the most popular choices.
Similarly, scripts used for analyses should be released to secure
reproducibility and potential reuse in other applications. For the sake
of clarity and machine-readability, it is best to use standardised
MIMARKS and MIxS terminology (Yilmaz et al., 2011).
In the materials and methods section of scientific papers, it is
important to document all aspects of the analysis Lindahl et al., 2013).
This is likely to shorten the review process and help reviewers and
readers to evaluate the validity and novelty of the procedures.