Species classification and species-specific markers
The main two parameters of resolution for peak detection, i.e.,
half-window-size (HWS) on the m/z axis and signal to noise ratio (SNR)
on the intensity axis, influenced the success of species classification.
The out-of-bag error rate was below 0.6% in the range of SNR 4 to 8 and
HWS 4 to 8 (Fig. 2). Peak number decreased with increasing SNR from
around 3,000 to quite stable 500 from an SNR of 10 onwards. All
specimens were correctly identified by the RF model at a SNR of 4 and a
HWS of 3. These parameter settings were then used for all further
analysis.
Overall, 2,418 peaks from all species and specimens were included in the
analysis. Peaks per specimen ranged between 163 and 562 with an average
of around 300 peaks per individual. To search for ubiquitous compounds
and to estimate specificity and sensitivity of specific peaks, compound
occurrence and abundance was analyzed. Only three peaks were present in
all species, albeit not in all specimens: m/z 3,920 (in 33% of all
specimens), 3,417 (in 22% of all specimens) and 3,065 (in 8% of all
specimens). Common peaks between species (disregarding varying
intra-specific peak frequency) showed a bimodal distribution of
occurrence, with 70% of all peaks occurring in more than 10 species
(Fig. 3A). In total, 398 peaks were observed with a 100% intra-specific
frequency, i.e. they occurred in all specimens of a species. While 315
peaks were found with 100% in only one species, 83 peaks were found in
up to six species (Fig 3B). These peaks were generally of higher
intensity than average peaks (Fig. 3C). No peak with 100% frequency was
observed in T. longicornis and C. hamatus and only one
peak in A. longiremis and two in C. typicus . These are the
species in the data set with most included regions and/or regions with
strong environmental variation in salinity and temperature. Peaks with
highest specificity (i.e., the 315 peaks with 100% intra-specific
frequency in only one species) were compared for occurrence in other
species, a measure for the sensitivity of potential markers. Mean
intra-specific frequency of these peaks varied around 25% and maximum
frequency around 75% (Fig. 3D). Hence, no single species-specific
marker could be identified in the proteomic spectra of the copepods,
when integrating over seasons, samples and regions.
Nevertheless, species identification was reliable using random forest.
The 170 most important markers given by the class-specific mean
decrease in accuracy (i.e. those peaks with high discriminatory power in
the nodes of the decision trees) were extracted from the random forest
model (Fig. 4). These discriminant peaks were quite evenly distributed
over the whole m/z range of 2-11 kDa, also including peaks of different
intensities. Generally, species-characteristic peaks were of lower
importance in species that included specimens from many regions compared
to species analyzed in only one or two regions.