Figure Legends
Figure 1: Overview on included regions, NWA: North-West Atlantic, CAN: Canada, ICE: Icelandic waters, CEA: Central-East Atlantic, MED: Mediterranean, NOS: North Sea, CBS: Central Baltic Sea, NWS: Norwegian Sea, WHS: White Sea, BAR: Barents Sea, OFJ: Oslofjord, GFJ: Gullmarsfjord, BFJ: Balsfjord (also see supplement table 1)
Figure 2: left panel: impact of peak detection parameters (SNR = signal to noise ratio threshold for peak picking, influencing resolution on intensity axis and HWS = half window size of peak picking algorithm, influencing resolution on m/z axis) on species classification success of the random forest model, right panel: impact of SNR on number of peaks
Figure 3: A : number of peaks, grouped by number of species with these peaks in common, B : number of peaks, grouped by intra-specific frequency, C : peak intensity as boxplot (without outliers) for all peaks and for peaks with 100% intra-specific frequency, D : max. and mean intra-specific peak frequency of the 315 potential single-markers in other species (100% intra-specific frequency in only one species)
Figure 4: Heatmap of 170 most important peaks for the species classification random forest model (peaks with maximum of class-specific mean decrease in accuracy of >0.015 are presented); clustering of species is based on hierarchical clustering (average linkage) of the species-mean Euclidean distance based on the whole peak spectrum, the annotation gives maximum peak intensity of the given m/z peak over the whole dataset; heatmap scaling: 0-0.1 class-specific mean decrease in accuracy, peak intensity scaling: 1-7*10-3 arbitrary unit), species included in this analysis: Acartia bifilosa (Abif), A. clausi (Acla), A. danae (Adan), A. negligens (Aneg), A. longiremis (Alon), A. tonsa (Aton), Calanus finmarchicus (Cfin), C. helgolandicus (Chel), C. glacialis (Cgla), C. hyperboreus (Chyp), Centropages bradyi (Cbra) , C. typicus (Ctyp), C. hamatus (Cham), C. chierchiae (Cchi), Metridia longa (Mlon), M. lucens (Mluc), Pseudocalanus elongatus (Pelo), P. moultoni (Pmou), Temora longicornis (Tlon), T. stylifera (Tsty), Paraeuchaeta norvegica (Pnor), Microcalanus sp. (Mcal), Anomalocera patersonii (Apat), Nannocalanus minor (Nmin), Eurytemora affinis (Eaff), Limnocalanus macrurus (Lmac), Corycaeus anglicus (Cang)
Figure 5: Principal Coordinates Analysis (PCoA) on proteomic spectra of congener species, species included: Acartia bifilosa (Abif), A. clausi (Acla), A. danae (Adan), A. negligens (Aneg), A. longiremis (Alon), A. tonsa (Aton), Calanus finmarchicus (Cfin), C. helgolandicus (Chel), C. glacialis (Cgla), C. hyperboreus (Chyp), Centropages bradyi (Cbra), C. kroyeri (Ckro), C. typicus (Ctyp), C. hamatus (Cham), C. chierchiae (Cchi), Metridia longa (Mlon), M. lucens (Mluc), Pseudocalanus elongatus (Pelo), P. moultoni (Pmou), Temora longicornis (Tlon), T. stylifera (Tsty)
Figure 6: Boxplots based on species-specific means (upper panel) and 10 or 90% quantiles (lower panel) of Euclidean distances, providing inter-specific distances and intra-specific distances based on specimen from different regions, from only the same region and the same sample respectively
Figure 7: Heatmaps of Euclidean distance based on the proteomic spectrum of specimens from different regions (annotation, abbreviations see Fig. 1), hierarchical clustering with average linkage, congener pairs included: Acartia clausi and A. longiremis, Centr opages typicus and C. hamatus, Temora stylifera, andT. longicornis, Calanus hyperboreus and C. finmarchicus