2.4.2 | Complex admixture model-choice with Random-Forest ABC
For ABC model-choice, we performed 10,000 independent MetHissimulations for each nine competing-scenarios. To mimic our case study datasets (see 2.4.3), we simulated 100,000 SNPs and sampled 50 individuals in population H, and 90 and 89 individuals respectively in the African and European source populations. Using 27 cores and the above design, we performed the 90,000 simulations with MetHis in four days, with 2/3 of that time for summary-statistics calculation only (Supplementary Note S1 ).
We used Random-Forest ABC for model-choice implemented in theabcrf function of the abcrf package to obtain the cross-validation table and associated prior error rate using an out-of-bag approach (Figure 2 ). We considered a uniform prior probability for the nine competing models. We considered 1,000 decision trees in the forest after visually checking that error-rates converged appropriately (Supplementary Figure S3 ), using theerr.abcrf function. RF-ABC cross-validation procedures using groups of scenarios were conducted using the group definition option in the abcrf function (Estoup, Raynal, Verdu, & Marin, 2018). Finally, each summary statistics relative importance to the model-choice cross-validation was computed using theabcrf function (Supplementary Figure S4 ).
We explore model-choice erroneous assignation due to model nestedness in the parameter space, by considering 1,000 randomly chosen simulation per model as pseudo-observed data (Supplementary Figure S5 ). We train the RF algorithm based on the 9000 remaining simulations per model using the abcrf function similarly as above, which provides highly similar results as when considering 10,000 simulations per model (results not shown). We then use the predict.abcrf function to perform model choice independently for each 1000 simulated pseudo-observed data with known parameter vectors.
To empirically evaluate the power of the RF-ABC model-choice to distinguish complex admixture processes, we conducted similar cross-validations procedures based on additional 10,000 per scenario for 50,000 and, separately, 10,000 SNPs, instead of 100,000 SNPs (180,000 simulations in total, Supplementary Figure S6A-B ).
Furthermore, using 100,000 SNPs, we produced 90,000 simulations and performed cross-validations (Supplementary Figure S6C ), considering a five-times smaller sample set, with 10 sampled individuals in population H (instead of 50 as previously) and 18 individuals in each source population (instead of 90 and 89).