BACKGROUND
Microbial communities play important roles in environmental and human health systems and can often reach great complexity. In these rich ecosystems, microbes interact with each other, forming relationships based on predator-prey dynamics [1], competition for resources [2], cross-feeding of small compounds, [3] and other factors. Identifying correlated pairs of microbes can suggest potential interactions or shared environmental preferences. Accordingly, studies have identified complex networks of co-occurring microbes in a variety of different environments ranging from the human mouth and gut [4] to soil [5] and stream ecosystems [6].
To detect correlations between microbes, a variety of methods have been developed. While traditional correlation metrics are used by some [7-9], newer methods have been developed that take into account the properties of 16S rRNA sequencing data [10-12]. A recent review tested these methods on a variety of models and identified some methods that performed better than others in ways that can depend on underlying data characteristics [13]. Although these tools are useful for finding pairwise relationships between organisms, less attention has been given toward developing methods for finding correlations among groups of microbes.
One way to explore complex interactions is to form networks in which correlated organisms are joined with an edge, and highly correlated sets of microbes are defined. Here, we refer to these sets as modules, which are synonymous to clusters or groups. There are two primary benefits of finding modules of correlated microbes. First, the combination of microbes in a module could be further explored to understand microbial interactions, such as cross-feeding relationships, or shared environmental niches [5, 14-16]. Second, considering correlation structure among microbes can aid in statistical analysis aimed at uncovering relationships between microbes and other environmental factors. Specifically, by eliminating or summarizing highly correlated features, dependence between features is decreased. Feature reduction will increase accuracy of methods that assume the independence of features such as false discovery rate technique (FDR) measurements like the Benjamini-Hochberg Correction [17], and statistical power is increased by reducing the number of feature comparisons.
One workflow for considering groups of correlated microbes in downstream statistical analyses requires three steps: first, correlations between microbes must be measured and used to form a network; second, modules must be identified; and third, abundance of the microbes in modules must be summarized for use in subsequent statistical analyses. One software tool that has implemented this workflow, developed for application to gene expression data, is weighted gene correlation network analysis (WGCNA)[18]. WGCNA builds correlation networks based on a correlation coefficient (such as Pearson, Spearman, or biweight midcorrelation [19]), and detects modules as subtrees in a hierarchical cluster of features [20]. Modules are summarized by setting module abundance to that of network hubs or an eigenvector of the abundance of all module members [18].
Several groups have used WGCNA to find correlations within 16S rRNA sequencing data [21-24], but this approach may not be appropriate for several reasons [25]. First, the correlation metrics implemented in WGCNA do not account for sparsity and compositionality. Most sequencing-based microbiome datasets are sparse (i.e. there are many zeros) and compositional, meaning they only carry information on relative abundances of taxa instead of absolute abundances, which can lead to the detection of spurious correlations if proper statistical methods are not used [26]. Thus, the use of WGCNA for compositional data may be leading to the detection of spurious edges in microbiome networks. Second, the primary method WGCNA uses to pick modules assumes the correlation network will have a scale-free topology that may not be relevant to microbiome data [27]. Third, summarizing modules through identifying hub taxa works well in gene expression where a single transcription factor can control the expression of many genes, but may not be appropriate in microbial communities. Both the hub and eigenvector approaches to module summarization do not allow for output tables that maintain the total counts of microbial abundance per sample. Therefore, the hub and eigenvector approaches cannot be used with tools developed for microbiome data analysis that make assumptions based on total sample counts, such as ANCOM [28] or metagenomeSeq [29].
Optimal methods for identifying and summarizing modules of correlated features in 16S rRNA sequencing data have not been deeply explored. One study [25] recommended an ensemble approach for correlation detection, and the Louvain modularity maximization (LMM) method [30] to identify modules. LULU is a tool that follows a binning approach towards OTUs that co-occur, but only does so if they are highly phylogenetically related [31]. Another tool, CoNet, uses an ensemble approach to build and visualize networks [32]. However, no implementation of module summarization was made available for downstream statistical analysis.
To address these gaps, we have developed a tool for sparse, compositional correlation network investigation for compositional data (SCNIC), which uses methods optimized for microbiome data analysis. SCNIC is available as standalone Python software, via Bioconda [33] and the package installer for Python (pip), and as a QIIME 2 plugin [34]. The source code for SCNIC and the QIIME 2 plugin is freely available on GitHub (https://github.com/lozuponelab/SCNIC , https://github.com/lozuponelab/q2-SCNIC) under the BSD-3-Clause License.