Fig. 1 Sampling locations
These sequence data have been submitted to the European Nucleotide Archive (ENA) under accession number PRJEB65323. (not yet published)
2.3 DNA extraction
The genomic DNA was extracted from 40 to 80 mg of dried needles after grinding them for 2 minutes in a FastPrep®-24 instrument (MP BIOMEDICALS) device using the Invisorb® Spin Plant Mini Kit (STRATEC MOLECULAR), DNeasy® Plant Mini Kit (QIAGEN) or DNeasy® Plant Pro Kit (QIAGEN), following the supplier’s protocol, but the buffers were heated before use.
2.4 GBS, de novo assembly, and variant call
Genotyping by sequencing (GBS) was conducted following Wendler et al. (2014) to obtain genome-wide single nucleotide polymorphisms (SNPs). Genomic DNA (200 ng) was digested using restriction enzymes PstI-HF and MspI. PstIHF is a rare-cutting enzyme (recognition site: CTGCA’G) and the methylation-sensitive enzyme MspI (recognition site: C’CGG) is a frequent cutting enzyme. Libraries were sequenced on an Illumina HiSeq 2000 and NovaSeq6000, generating single-end reads of 100-120 bp.
Barcoded Illumina reads were de-multiplexed using the Casava pipeline 1.8 (Illumina), trimmed, filtered, and de novo assembled with ipyrad v. 0.7.30 (Eaton and Overcast 2020).
A maximum of 5 low quality (Q<20) bases were allowed in a read. Consensus base calling was also part of the ipyrad pipeline with a minimum depth set at 6. A threshold of at least 85% sequence similarity was set to identify homologous sequences, and thus cluster together. The minimum number of samples that must have data at a given locus was set to 60%. Heterozygous sites were allowed for a maximum of 50% of the samples.
For additional details, the ipyrad parameter file is provided in Appendix S2. The geno file was used to generate the input file for DIYABC (Appendix S3).
2.5 GBS data analysis
The genetic structure among populations and between individuals has the potential to unravel spatial distribution patterns. Individuals will have more similar genotypes when they originate from the same population. Thus, it is possible to evaluate how they cluster based on their genotypes (Cornuet et al. 1999). Here we use the Bayesian population assignment using the R package LEA to serve as an illustration of the distribution of ancestry coefficients. It represents the spatial predictions on a geographic map. The acquired SNP data are linked to geographical information and patterns are inferred through the analysis of ancestry coefficients and admixture rates (cluster analysis) (Frichot and François 2015). Subsequently, the evaluation of potential biogeographical dynamics of different larch populations in Russia will be performed via the model-based approach Approximate Bayesian Computation, implemented in the software DIYABC (Cornuet et al. 2014).
2.5.1 Spatial distribution of SNPs
We used optimized versions of principal component analysis (PCA) and non-negative matrix factorization algorithms (sNMF) (Frichot et al. 2014) as implemented in the R package LEA version 2.8.0 (Frichot and Francois 2015; Francois 2016). To define the optimal number of genetic clusters (K) we tested K values from 1 to 6 based on knowledge about the number of larch species and main hybrids in Siberia. The final number of clusters (K) was selected by choosing the K-value with the lowest cross-entropy. Only the K with the maximal local contribution to ancestry is represented at each point of the geographic map.
2.5.2 Biogeographic inference - Approximate Bayesian Computation (ABC)
Linkage disequilibrium (LD), Hardy Weinberg equilibrium (HWE), minor allele frequency (MAF), and data missingness (MISS) of the SNP dataset were inferred using the software PLINK (Purcell et al. 2007). SNPs were LD pruned with a window of 50 SNPs, a step size of 20 makers, and r2 threshold 0.05. The SNPs showing severe distortion of the HWE (p < 0.05), or MAF lower than 5%, as well as SNP markers with missing data (MISS) above 20% were discarded from further analysis.
We applied an Approximate Bayesian Computation (ABC) using the software DIYABC v. 2.1.0 (Cornuet et al. 2014). The potential demographic history was inferred via complex evolutionary scenarios. To keep the scenarios in the ABC simple, four populations were defined based on the results of the cluster analysis and the admixture plots: Pop1 (Western Siberia= WSib), Pop2 (Western Yakutia= WYak), Pop3 (Eastern Yakutia= EYak), and Pop4 (Chukotka= Chuk). Alternative scenarios including the predefined four distinct genetic groups for the estimation of the population demographic history were constructed. In these scenarios, t# represents the time scale, measured by generation time, and N# represents the effective population size of the corresponding populations. The 12 examined scenarios viewed backward in time, are as follows (Fig. 2):
Scenario 1: Pop4/ Chuk merged with Pop3/ EYak at t1, and then they merged with Pop2/ WYak at t2. They merged with Pop1/ WSib at t3. Scenario 2: Pop3/ EYak merged with Pop2/ WYak at t1, and then Pop4/ Chuk merged with them at t2. They merged with Pop1/ WSib at t3. Scenario 3: Pop3/ EYak was created by an admixture of Pop2/ Wyak and Pop4/ Chuk at t1, and then Pop4/ Chuk merged with Pop2/ WYak at t2. Pop2 merged with Pop1/ WSib at t3. Scenario 4: Three populations (Pop1/ WSib, Pop3/ EYak, Pop4/ Chuk) split at the same time, namely t1, and then Pop2/ WYak merged with them at t2. Scenario 5: All four populations (Pop1/ WSib, Pop2/ WYak, Pop3/ EYak, Pop4/ Chuk) split at the same time, namely t1. Scenario 6: The order of the hierarchical splits is the same as in scenario 1 but a recent bottleneck was modeled in all four populations at t1-db (db: time scale to bottleneck before t1): population size changed from N1 to N1b, from N2 to N2b, from N3 to N3b, and from N4 to N4b. N4b merged with N3b at t2, and then they merged with N2b at t3. They merged with N1b at t4. Scenarios 7–12: These are the same as scenarios 1–6 but an ancient and severe bottleneck was modeled: population size changed from N1 to N1a (and from N1b to N1a for scenario 12) at t4-db for scenarios 7–9, at t3-db for scenario 10, at t2-db for scenario 11, and at t5-db for scenario 12.