2.2 Environmental variables acquisition and filtering
The spatial distribution of species can be significantly influenced by the surrounding habitat environment (Zhao et al., 2023). In this study, we obtained raster data of variable projections from the comprehensive dataset Bio-ORACLE (http://www.bio-oracle.org) and Global Marine Environment Datasets (http://gmed.aucklan-d.ac.nz), considering their biological relevance and data availability. To assess collinearity among candidate predictor variables, scatter collinearity analysis, Pearson’s correlation coefficients, and the variation inflation factor (VIF) were employed (Zuur et al., 2010). Variables with correlation values below 0.8 and VIF values below 10 were retained. Nine predictors were considered based on the collinearity analysis, including mean current velocity, mean salinity, mean temperature, temperature range, mean dissolved oxygen, light at bottom, phytoplankton, water depth, and land distance (Table 1, Fig. 2).
Bio-ORACLE provides future environmental projections for two time periods (2040-2025, 2090-2100) using three atmospheric-ocean general circulation models (AOGCMs: CCSM4, HadGEM2-ES and MIROC5) under four Representative Concentration Pathways (RCPs) emission scenarios: RCP 2.6, RCP 4.5, RCP 6.0 and RCP8.5 (Assis et al., 2018). To reduce uncertainty in environmental projections, we averaged the predictions from the three AOGCMs to represent future climate conditions. RCP 2.6 indicates an optimistic emission level with well-controlled greenhouse gas concentration, while RCP 4.5 and RCP 6.0 represent moderate emission levels. RCP8.5 represents a pessimistic scenario without controlled emissions (Moss et al., 2010). For comparability, we used two RCPs (RCP 2.6 and RCP 8.5) to predict future distributions for the 2050s (2040–2050) and 2100s (2090–2100). Future environmental factors, including current velocity, mean salinity, mean temperature, temperature range, are downloaded from Bio-ORACLE, while the other factors were assumed to remain unchanged in the future (Zhang et al., 2021).
2.3 Model construction and assessment
The modeling program used in this study is SDM based on the biomod2 package in the R platform (Thuiller et al., 2020). The package includes ten modeling algorithms: maximum entropy (Maxent), random forest (RF), surface extent envelope (SRE), multiple adaptive regression splines (MARS), artificial neural network (Moss et al.), flexible discriminant analysis (FDA), classification tree analysis (CTA), generalized boosting model (GBM), generalized linear model (GLM), and generalized additive model (GAM). Species distribution modeling and prediction rely on existing/pseudo-missing records and current environmental data. Due to limited actual sampling point data for most species, pseudo-distribution data is used as a substitute to overcome this limitation (Barbet-Massin et al., 2012). The pseudo-absences function in the R package MOPA is employed to randomly simulate an equal number of pseudo-absent records and compare them with the conditions of actual presence points, thereby improving the model’s predictive performance (Guisan et al., 2017). To evaluate the accuracy of the model predictions, a five-fold cross-validation method with 10 replicates was used (Fu et al., 2021). In this methodology, 80% of the dataset is randomly selected for model calibration and testing, while the remaining 20% is reserved for assessing model predictions.
Two model evaluation indices embedded in biomod2, namely true skill statistics (TSS) and the area under the receiver operating characteristic curve (AUC), are calculated to estimate predictive accuracy. Given the potential diversity in outcomes generated by the ten models, an ensemble projecting approach is adopted to reduce uncertainty and enhance reliability (Buisson et al., 2010; Morato et al., 2020). To improve the accuracy of ensemble models, only models with an AUC greater than 0.8 and a TSS greater than 0.7 are retained, indicating high predicted accuracy and low uncertainty (Allouche et al., 2010; Mei et al., 2017).
Species-level and population-level modeling analyses are conducted, drawing species response curves for each environmental variable using the generated SDMs to visualize the variation in species occurrence probability along the environmental gradient. The significance of each predictor is estimated using a randomly permuted method (Guisan et al., 2017). The potential distributions of the entire species and the two populations (EIOS and WPI) under present and future climate scenarios (2050s, 2100s) for RCP 2.6 and RCP 8.5, respectively, were predicted using the ensemble models at both species and population levels. Continuous habitat suitability maps are created based on the direct outputs from the ensemble models. Using automatically derived thresholds that maximize the TSS values of the ensemble model, we translated the continuous prediction into binary values for a clearer comprehension of habitat appropriateness (Liu et al., 2013).