Data processing
Spectra in the range of 2-20 kDa were processed with MALDIquant (Gibb &
Strimmer, 2012) and MALDIquantForeign (Gibb 2017) using square root
transformation, savitzky golay smoothing with a half window size of 10,
baseline removal by the statistics-sensitive non-linear iterative
peak-clipping algorithm (SNIP, Ryan et al., 1988) and normalization
setting the total ion current set to 1. Optimal peak detection
parameters were derived by varying the signal to noise ratio (SNR)
thresholds for peak identification and the half window size (HWS) of
peak picking, both in the range of 3-15 with species classification
success of the random forest model (method see below) as target
variable. The highest classification success was reached with a SNR of 4
and a HWS of 3, these values were then used for final peak detection.
Picked peaks were repeatedly binned to compensate for small variation in
the m/z values between measurements until the intensity matrix reached a
stable peak number (tolerance 0.002, strict approach). All signals below
the SNR were set to zero in the final peak matrix. For all further
analysis peak intensities were Hellinger transformed (Legendre &
Gallagher, 2001) using the R package vegan (Oksanen et al., 2019) as
this proved to be beneficial for proteomic data (Rossel & Martinez
Arbizu, 2018a).