DISCUSSION
SCNIC provides a method to measure correlations, find and visualize
modules of correlated features, and summarize modules by summing their
counts for use in downstream statistical analysis as one method for
dimensionality reduction. Using SCNIC with the SMD algorithm for module
detection aids in feature reduction in 16S rRNA sequencing data while
ensuring a minimum strength of association within modules. As expected,
our workflow identified modules in which OTUs tended to be
phylogenetically related, especially at relatively high values of R.
Using SCNIC, we overall achieved increased statistical power from
performing less comparisons, but use of low R-value thresholds had the
potential to lead to loss of significance by binning loosely correlated
features. In this analysis, we used OTUs as features; however, other
microbiome features can be used with SCNIC, such as ASVs, genera, or
species defined with a taxonomic classifier, as well as other data types
such as metabolome data. SCNIC has also been used in previously
published work to perform feature reduction prior to random forest
analysis with the microbiome and diverse other data types [56].
SCNIC complements existing methods because these either: 1) form
correlation networks of microbes for visualization but do not have
functionality for selecting and summarizing modules for downstream
statistical analysis[32], 2) can select and summarize modules for
downstream statistical analysis but are designed for gene expression and
not microbiome data[18], only summarize features if they are
phylogenetically related[31], or suggest methods for finding modules
of correlated microbes but do not provide a convenient
implementation[30]. SCNIC is available both as a stand-alone
application and as a QIIME 2 plugin for easy integration with existing
microbiome workflows.
SCNIC implements both the LMM algorithm, which had been previously
recommended for selecting modules of correlated microbes [25, 57],
and a novel SMD algorithm. The advantage of the SMD algorithm is that
all pairs of features in the module have an R-value greater than the
user-provided minimum threshold. Using real and simulated data, we
showed that SMD produced smaller modules that generally represent
sub-graphs of the larger LMM modules. Since the use of lower R-value
thresholds similarly produced larger modules including more weakly
correlated modules, we speculate that use of LMM might result in a
similar trend of identifying more OTUs within significant modules, but
with the disadvantage of individually significant OTUs being lost
because they are combined with loosely correlated microbes that are not
related to the outcome being tested.
We illustrate here that varying the R-value threshold input by the user
has a great impact on the results. However, we have avoided giving
specific R-value threshold recommendations here, because optimal
R-values may vary across datasets and data types. Using higher R-values
thresholds was more likely to identify highly phylogenetically related
microbes that likely share overlapping functionality, and in principle
could also identify diverse organisms with overlapping niches or highly
complementary metabolic functions. Using a lower R-value threshold bins
a broader community of more loosely correlated features with the risk of
bringing together features which should not be grouped and loosing
significance of OTUs – as was illustrated in the Great Lakes dataset
analysis conducted here. By summarizing correlated features, SCNIC
mitigates overcorrection in multiple test adjustments by reducing the
number of taxa and false discovery rate for downstream analysis. When
these organisms are grouped into a broader module that is truly
independent from other modules, any penalties on two highly similar
features may be avoided in statistical analysis.
The results of our HIV dataset analysis confirm original findings, as
well as those of another study[58], but included many new
significantly associated taxa. SCNIC also assists in interpretation of
microbiome data by identifying correlations among these taxa. Our
results recapitulated those of the original publication of these data
and previous HIV microbiome studies that all found enrichment ofPrevotella with MSM status [43, 58-60]. However, our analyses
provide additional insight by identifying correlations between
differentiating taxa. For instance, in module-0 , which was more
abundant in MSM samples, OTUs assigned taxonomically to thePrevotella genus are correlated with two OTUs identified asEubacterium biforme (which has recently been renamedHoldemanella biformis [61] ). Prevotella copri has
previously been associated with increased inflammation [59] whilein vitro stimulations of human immune cells have found thatP. copri did not induce particularly high levels of inflammation
but E. biforme did [60]. This strong correlation betweenP. copri and E. biforme in MSM could explain the increased
inflammation seen in individuals with higher levels of P. copri,with E. biforme being the true driver. Indeed, MSM status has
previously been associated with increased inflammation [62, 63].
With the use of SCNIC, this correlation highlighted a route of
mechanistic understanding which could be functionally followed up on in
further experimental studies.
SCNIC detected multiple significant modules, of which none of the OTUs
within were significant when analyzed separately. Module-20 ,
which was associated with MSM status, is the fourth most significant
feature at R-value of 0.2, and is made up of Acidaminococcus ,Megasphaera , and Mitsuokella species. These are all from
the Veillonellaceae family which is likely the explanation for
their correlation. Members of the Veillonellaceae family have
been linked with inflammation [64].
By increasing statistical power and providing context for the
relationships between significant taxa, SCNIC modules open new
opportunities for analysis. When a module is associated with a variable
of interest, the correlations within the module may imply functional
relationships. These can be further investigated with in vitroand in vivo experiments. Studies which aim to test hypotheses
generated from correlative analysis will commonly use a single
significantly associated microbes. This often does not adequately
represent in vivo systems because microbes in isolation often do
not affect a disease state or their environment. SCNIC can enhance these
confirmatory studies by identifying groups of microbes that may grow
better than individual microbes and may better elicit relevant
phenotypes than when grown separately.