Species classification and species-specific markers
The main two parameters of resolution for peak detection, i.e., half-window-size (HWS) on the m/z axis and signal to noise ratio (SNR) on the intensity axis, influenced the success of species classification. The out-of-bag error rate was below 0.6% in the range of SNR 4 to 8 and HWS 4 to 8 (Fig. 2). Peak number decreased with increasing SNR from around 3,000 to quite stable 500 from an SNR of 10 onwards. All specimens were correctly identified by the RF model at a SNR of 4 and a HWS of 3. These parameter settings were then used for all further analysis.
Overall, 2,418 peaks from all species and specimens were included in the analysis. Peaks per specimen ranged between 163 and 562 with an average of around 300 peaks per individual. To search for ubiquitous compounds and to estimate specificity and sensitivity of specific peaks, compound occurrence and abundance was analyzed. Only three peaks were present in all species, albeit not in all specimens: m/z 3,920 (in 33% of all specimens), 3,417 (in 22% of all specimens) and 3,065 (in 8% of all specimens). Common peaks between species (disregarding varying intra-specific peak frequency) showed a bimodal distribution of occurrence, with 70% of all peaks occurring in more than 10 species (Fig. 3A). In total, 398 peaks were observed with a 100% intra-specific frequency, i.e. they occurred in all specimens of a species. While 315 peaks were found with 100% in only one species, 83 peaks were found in up to six species (Fig 3B). These peaks were generally of higher intensity than average peaks (Fig. 3C). No peak with 100% frequency was observed in T. longicornis and C. hamatus and only one peak in A. longiremis and two in C. typicus . These are the species in the data set with most included regions and/or regions with strong environmental variation in salinity and temperature. Peaks with highest specificity (i.e., the 315 peaks with 100% intra-specific frequency in only one species) were compared for occurrence in other species, a measure for the sensitivity of potential markers. Mean intra-specific frequency of these peaks varied around 25% and maximum frequency around 75% (Fig. 3D). Hence, no single species-specific marker could be identified in the proteomic spectra of the copepods, when integrating over seasons, samples and regions.
Nevertheless, species identification was reliable using random forest. The 170 most important markers given by the class-specific mean decrease in accuracy (i.e. those peaks with high discriminatory power in the nodes of the decision trees) were extracted from the random forest model (Fig. 4). These discriminant peaks were quite evenly distributed over the whole m/z range of 2-11 kDa, also including peaks of different intensities. Generally, species-characteristic peaks were of lower importance in species that included specimens from many regions compared to species analyzed in only one or two regions.