2.6 Similarity in gene expression among samples
To assess the variation and direction of variation among samples based
on their gene expression, we calculated the correlation of gene
expression levels among samples and the Euclidean distances among
samples in DESeq 2 (version 1.22.2; Love et al., 2014) following the
program directions. These measures are especially useful to assess the
similarity of biological replicates (e.g., samples belonging to the same
group) (Koch et al. 2018) and therefore to detect anomalies among the
samples. The sample correlation matrix was calculated by performing the
Pearson correlation of the normalized matrix after the variance
stabilizing transformation (vst ) was performed on the most
variable 2000 genes based on the HTSeq data produced. vst allows
taking into account the sample variability of low counts.
Sample Pearson correlation is calculated in pairwise comparison between
samples and ranges from -1 to 1, where a value of 0 indicates no
correlation (gene expression is completely dissimilar between the two
samples), while values of 1 indicate that the samples have identical
expression level (and -1 corresponds to negative correlation). The
Euclidean distance between sample expression profiles was calculated by
this equation: dist = sqrt(1- cor2) , wherecor stands for the correlation coefficient of 2 samples. The
smaller the distance, the higher the correlation between samples. These
distances were then used to build the heatmaps of sample distance of
each normalized matrix, which allows the data to be shrunken towards the
genes’ average expression across all samples. Gene heatmaps were instead
based on vst transformation to normalize the raw count. After this, the
mean expression in each sample is then normalized to 0. Finally,
differences in gene expression among the studied groups (see below) were
visualized by a PCA plot using the gene count matrix after applying the
variance stabilizing transformation (vst ) to normalize the raw
counts. PCA plots are useful to assess the effect of covariates and
batch effects (non-biological variation due to experimental artifacts
(Reese et al. 2013).