The SCNIC method
SCNIC takes a feature table containing counts of each feature in all
samples as input and performs three steps: 1) a correlation network is
built, 2) modules are detected in the network and 3) feature counts
within a module are summed into a new single feature (identified as
“module-x ” where x is whole numbered consecutively starting at
zero)(Figure 1). The modules are ordered based on size, where the lower
numbered modules have a larger number of members compared to higher
numbered modules. To summarize modules, SCNIC uses a sum of count data
from all features in a module. There is no maximum or minimum size
constraint on module size when modules are created. The newly generated
modules are included in a new feature table alongside all features not
grouped into a module. This maintains the total counts per sample,
allowing for downstream analyses with tools that have assumptions
related to total sample counts. SCNIC produces a graph modeling language
(GML) format [35] file compatible with Cytoscape [36] for
network visualization in which the edges in the correlation network
represent the positive correlations which are stronger than a user
specified R-value cutoff (between 0 and 1), a file describing which
features compose each defined module, and a feature table in the
Biological Observation Matrix (BIOM) [37] (Figure 1).
SCNIC allows users to choose between multiple methods for detecting
correlations and of defining modules of co-occurring microbes. For
correlations, SCNIC can implement traditional correlation metrics
(including Pearson’s r , Spearman’s ⍴ and Kendall’s τ) or
the compositionality- and sparsity-aware correlation metric from SparCC
[38, 39] to correct for aspects of microbiome data. SparCC has been
shown to perform well in detecting correlations compared to other
correlation measures [13]. Specifically, SparCC performs well in
communities with an inverse Simpson index above 13 (which would be
indicative of a high number of successful species, a complex food web,
and many ecological niches, as would be seen in many high biomass
microbial communities such as gut or soil microbiomes) [39,40], and
it thus was chosen as the default metric.
To define modules of co-correlated features, we implement two methods:
1) Louvain modularity maximization (LMM) and 2) a novel shared minimum
distance (SMD) module detection algorithm; unlike WGCNA, neither of
these algorithms make assumptions about network topology. LMM was
previously proposed as a method for clustering correlation networks of
microbes into modules [30]. LMM works by first assigning one module
per feature. Each pair of adjacent modules are joined and the change in
modularity (defined by the number of edges within the module compared to
outside) is calculated for each module. The pair which increases the
mean modularity of the network the most is then joined. This process is
repeated until the modularity of the network is not increased. LMM uses
two parameters provided by the user: The first parameter, R-value,
defines the minimum correlation coefficient for defining an edge between
features. The second parameter, gamma (also referred to as resolution),
controls the size of modules detected, with large gamma values yielding
larger modules.
WGCNA and LMM have a potential weakness in that modules can contain
pairs of taxa that are not strongly correlated (e.g. if they are several
steps away from each other in the network). To address this weakness, we
also implement the SMD method to ensure that correlations between all
pairs of features in the module have an R-value greater than the user
provided minimum (Figure 2). Specifically, the SMD method defines
modules by first applying complete linkage hierarchical clustering to
correlation coefficients to make a tree of features. Next, SMD defines
modules as subtrees where correlations between all pairs of tips have an
R-value above the specified value. SMD has been set as the default
method in SCNIC because of the desirable property of only producing
modules where all features are correlated over a user-specified
threshold.
A large proportion of microbiome studies sample highly uneven
communities which leads to strong compositionality-driven artifacts
[26, 40, 41]. Because of this, we use SparCC, specifically the
implementation of FastSpar [39], as the default correlation measure.
SparCC was used as the correlation metric based on analysis that
suggested a high precision in the number of correct edges recovered when
correlations were calculated in synthetic data [13]. SCNIC
additionally includes the option of using Pearson’s r , Spearman’s⍴ and Kendall’s τ to evaluate non-compositional or dense data
types.