2.3 MALDI-TOF MS spectra processing and data modelling
Spectra were acquired using default settings and visualized with
FlexAnalysis software (Bruker Daltonics), where outliers and zero lines
were removed. MALDI-TOF MS spectra were exported to and further
processed with Clover MS Data Analysis software (Clover Biosoft,
Granada, Spain) as follows: a) Variance stabilization, b) Smoothing by
Savitzky-Golay filter (Window length: 11, Polynomial order: 3), c)
Baseline subtraction using Top Hat filter (0.02), and d) TIC-normalized.
Replicated peaks were aligned in the 2,000-20,000 Daltons region of the
spectra -were most bacterial proteins can be found- and then merged in
an average spectrum for each isolate according to the information
compiled in a previous study (Candela et
al., 2022).
As a first approach, the mass spectra from the 35 strains characterized
by WGS were used as the training set for data modelling. Two peak
matrices were built: A) using the Threshold method, that consisted on
applying a 0.01 threshold value to average spectra, which selected only
the peaks above 1.0% of the maximal intensity (Prominence: 0.01;
Distance: 1); and B) using the Biomarker selection method, that searches
for specific peaks for each category (“outbreak” or “control”P. aeruginosa isolates). Peaks within the 2,000-10,000 Daltons
range and with area under the curve (AUC) higher than 0.85 were
evaluated and selected for the construction of the biomarker peak
matrix.
Matrices built using both the Threshold and Biomarker selection methods
were used as an input data for training Machine Learning supervised
-Partial Least Squares Discriminant Analysis (PLS-DA); Linear Support
Vector Machine (SVM); Random Forest (RF) and Neighborhood Component
Analysis with K-Nearest Neighbors (NCA-KNN)- and unsupervised -Principal
Component Analysis (PCA) and Hierarchical Clustering (HC)- algorithms as
predictive models. Internal validation was performed for each predictive
model by k-fold cross validation method (k=10) as described previously
(Zvezdanova et al., 2022).
For the external validation of the predictive models described in the
previous paragraph, 32 isolates analyzed by ASO-PCR were included in the
validation set. Average spectra from these isolates were anonymized,
preprocessed as described for the creation of predictive models and used
as input data to predict their category.