3.1.1 Analysis of outbreak strains characterized by WGS
Firstly, a peak matrix was constructed using MALDI-TOF MS spectra from the strains previously analyzed by WGS (n=35) by applying the Threshold method. The isolates were initially classified according to PFGE results, where the strains were grouped as P1 (outbreak) and other pulsotypes considered as unrelated strains. The cross-validation of this approach (k=10) yielded 97.1% isolates correctly classified using PLS-DA, RF and NCA-KNN algorithms and 88.6% with SVM (Table S2). Besides, using the Biomarker selection method three potential biomarkers were found at 5169, 6915 and 7236 m/z . This peak matrix correctly classified all strains (100%) by internal k-fold validation (k=10) in all prediction models tested (PLS-DA, SVM, RF and NCA-KNN). The implementation of unsupervised algorithms also achieved optimal separation of the two main categories (“outbreak” and “control” strains) displaying two well defined clusters in PCA plot and HCA dendrogram (Figure 2).
In a second step, MALDI-TOF MS spectra were further compared according to WGS clustering, where the outbreak strains clustered by PFGE in the pulsotype 1 (P1) were divided into 3 outbreak groups: Group 1, considered the main outbreak strains, Group 2 and 3 (separated by <125 SNPs from Group 1) and Controls (>5.000 SNPs difference) -Figure 1-. Differentiation of what WGS considered the main outbreak (Group 1) from the rest of the strains (“Controls”, “Group 2” and “Group 3”) was attempted in this step. For this purpose, a peak matrix was created by applying the Threshold method and used as input data to PLS-DA, SVM, RF and NCA-KNN algorithms. They obtained a correct classification of 97.1% by SVM (C optimized hyperparameter: 0.01), 91.4% by PLS-DA, 88.5% by NCA-KNN (Neighbors optimized hyperparameter: 3) and 85.7% by RF (Number of estimators optimized hyperparameter: 100) (Table S3; Figure S2). Group 2 strains (n=2) appeared closer to the outbreak strains than Group 3 and control strains (Figure 2C), as it is closer to the Group 1 strains in number of SNPs (50 SNPs).