2.3 MALDI-TOF MS spectra processing and data modelling
Spectra were acquired using default settings and visualized with FlexAnalysis software (Bruker Daltonics), where outliers and zero lines were removed. MALDI-TOF MS spectra were exported to and further processed with Clover MS Data Analysis software (Clover Biosoft, Granada, Spain) as follows: a) Variance stabilization, b) Smoothing by Savitzky-Golay filter (Window length: 11, Polynomial order: 3), c) Baseline subtraction using Top Hat filter (0.02), and d) TIC-normalized. Replicated peaks were aligned in the 2,000-20,000 Daltons region of the spectra -were most bacterial proteins can be found- and then merged in an average spectrum for each isolate according to the information compiled in a previous study (Candela et al., 2022).
As a first approach, the mass spectra from the 35 strains characterized by WGS were used as the training set for data modelling. Two peak matrices were built: A) using the Threshold method, that consisted on applying a 0.01 threshold value to average spectra, which selected only the peaks above 1.0% of the maximal intensity (Prominence: 0.01; Distance: 1); and B) using the Biomarker selection method, that searches for specific peaks for each category (“outbreak” or “control”P. aeruginosa isolates). Peaks within the 2,000-10,000 Daltons range and with area under the curve (AUC) higher than 0.85 were evaluated and selected for the construction of the biomarker peak matrix.
Matrices built using both the Threshold and Biomarker selection methods were used as an input data for training Machine Learning supervised -Partial Least Squares Discriminant Analysis (PLS-DA); Linear Support Vector Machine (SVM); Random Forest (RF) and Neighborhood Component Analysis with K-Nearest Neighbors (NCA-KNN)- and unsupervised -Principal Component Analysis (PCA) and Hierarchical Clustering (HC)- algorithms as predictive models. Internal validation was performed for each predictive model by k-fold cross validation method (k=10) as described previously (Zvezdanova et al., 2022).
For the external validation of the predictive models described in the previous paragraph, 32 isolates analyzed by ASO-PCR were included in the validation set. Average spectra from these isolates were anonymized, preprocessed as described for the creation of predictive models and used as input data to predict their category.