DISCUSSION
SCNIC provides a method to measure correlations, find and visualize modules of correlated features, and summarize modules by summing their counts for use in downstream statistical analysis as one method for dimensionality reduction. Using SCNIC with the SMD algorithm for module detection aids in feature reduction in 16S rRNA sequencing data while ensuring a minimum strength of association within modules. As expected, our workflow identified modules in which OTUs tended to be phylogenetically related, especially at relatively high values of R. Using SCNIC, we overall achieved increased statistical power from performing less comparisons, but use of low R-value thresholds had the potential to lead to loss of significance by binning loosely correlated features. In this analysis, we used OTUs as features; however, other microbiome features can be used with SCNIC, such as ASVs, genera, or species defined with a taxonomic classifier, as well as other data types such as metabolome data. SCNIC has also been used in previously published work to perform feature reduction prior to random forest analysis with the microbiome and diverse other data types [56].
SCNIC complements existing methods because these either: 1) form correlation networks of microbes for visualization but do not have functionality for selecting and summarizing modules for downstream statistical analysis[32], 2) can select and summarize modules for downstream statistical analysis but are designed for gene expression and not microbiome data[18], only summarize features if they are phylogenetically related[31], or suggest methods for finding modules of correlated microbes but do not provide a convenient implementation[30]. SCNIC is available both as a stand-alone application and as a QIIME 2 plugin for easy integration with existing microbiome workflows.
SCNIC implements both the LMM algorithm, which had been previously recommended for selecting modules of correlated microbes [25, 57], and a novel SMD algorithm. The advantage of the SMD algorithm is that all pairs of features in the module have an R-value greater than the user-provided minimum threshold. Using real and simulated data, we showed that SMD produced smaller modules that generally represent sub-graphs of the larger LMM modules. Since the use of lower R-value thresholds similarly produced larger modules including more weakly correlated modules, we speculate that use of LMM might result in a similar trend of identifying more OTUs within significant modules, but with the disadvantage of individually significant OTUs being lost because they are combined with loosely correlated microbes that are not related to the outcome being tested.
We illustrate here that varying the R-value threshold input by the user has a great impact on the results. However, we have avoided giving specific R-value threshold recommendations here, because optimal R-values may vary across datasets and data types. Using higher R-values thresholds was more likely to identify highly phylogenetically related microbes that likely share overlapping functionality, and in principle could also identify diverse organisms with overlapping niches or highly complementary metabolic functions. Using a lower R-value threshold bins a broader community of more loosely correlated features with the risk of bringing together features which should not be grouped and loosing significance of OTUs – as was illustrated in the Great Lakes dataset analysis conducted here. By summarizing correlated features, SCNIC mitigates overcorrection in multiple test adjustments by reducing the number of taxa and false discovery rate for downstream analysis. When these organisms are grouped into a broader module that is truly independent from other modules, any penalties on two highly similar features may be avoided in statistical analysis.
The results of our HIV dataset analysis confirm original findings, as well as those of another study[58], but included many new significantly associated taxa. SCNIC also assists in interpretation of microbiome data by identifying correlations among these taxa. Our results recapitulated those of the original publication of these data and previous HIV microbiome studies that all found enrichment ofPrevotella with MSM status [43, 58-60]. However, our analyses provide additional insight by identifying correlations between differentiating taxa. For instance, in module-0 , which was more abundant in MSM samples, OTUs assigned taxonomically to thePrevotella genus are correlated with two OTUs identified asEubacterium biforme (which has recently been renamedHoldemanella biformis [61] ). Prevotella copri has previously been associated with increased inflammation [59] whilein vitro stimulations of human immune cells have found thatP. copri did not induce particularly high levels of inflammation but E. biforme did [60]. This strong correlation betweenP. copri and E. biforme in MSM could explain the increased inflammation seen in individuals with higher levels of P. copri,with E. biforme being the true driver. Indeed, MSM status has previously been associated with increased inflammation [62, 63]. With the use of SCNIC, this correlation highlighted a route of mechanistic understanding which could be functionally followed up on in further experimental studies.
SCNIC detected multiple significant modules, of which none of the OTUs within were significant when analyzed separately. Module-20 , which was associated with MSM status, is the fourth most significant feature at R-value of 0.2, and is made up of Acidaminococcus ,Megasphaera , and Mitsuokella species. These are all from the Veillonellaceae family which is likely the explanation for their correlation. Members of the Veillonellaceae family have been linked with inflammation [64].
By increasing statistical power and providing context for the relationships between significant taxa, SCNIC modules open new opportunities for analysis. When a module is associated with a variable of interest, the correlations within the module may imply functional relationships. These can be further investigated with in vitroand in vivo experiments. Studies which aim to test hypotheses generated from correlative analysis will commonly use a single significantly associated microbes. This often does not adequately represent in vivo systems because microbes in isolation often do not affect a disease state or their environment. SCNIC can enhance these confirmatory studies by identifying groups of microbes that may grow better than individual microbes and may better elicit relevant phenotypes than when grown separately.