Population genetic analyses
To evaluate the usefulness and power of GT-seq for population genetic analysis and further extend our understanding of polar bear population structure, we estimated several metrics of genetic diversity and structure using a combined dataset called and filtered with our BCF-6 workflow. This dataset comprised individuals genotyped at 322 autosomal SNPs using GT-seq, as well as individuals previously genotyped at our 322 GT-seq loci using ddRADseq (Jensen et al., 2020). After combining GT-seq and ddRADseq data, we removed all replicates (similarity >0.8, known duplicate subsamples, tissue from tissue-colon sets), and all individuals with >50% missing data. Our final GT-seq+ddRADseq dataset contained a total of 642 individuals genotyped at 322 autosomal loci and 2 sex-linked loci.
For the combined GT-seq+ddRADseq dataset, we estimated observed and expected heterozygosity (Ho/HE) and inbreeding coefficients (Gis) for each subpopulation, as implemented in GENODIVE v2.0b27 (Meirmans & Van Tienderen, 2004). Using the related package in R (Pew, Muir, Wang, & Frasier, 2015), we used our empirical allele frequencies to simulate pairs of individuals with known relatedness. One hundred pairs were simulated for each of the following categories: unrelated individuals, half siblings, full siblings, and parent-offspring pairs. We used this simulated dataset to test the ability of multiple relatedness estimators to distinguish among relatedness categories. Based on the results, we calculated pairwise relatedness between all 642 individuals in our combined dataset.
To assess population substructure, we used Bayesian clustering analysis as implemented in STRUCTURE 2.3.4 (Pritchard, Stephens, & Donnelly, 2000). We evaluated the number of clusters in the data (K) from 1 to 10, with 20 iterations of each, and with a run length of 300,000 MCMC following a burn-in period of 100,000 MCMC. The most likely value of K was identified using the DeltaK method of Evanno, Regnaut, & Goudet (2005) in Structure Harvester (Earl & vonHoldt, 2012). To complement the STRUCTURE analysis, we also used discriminant analysis of principal components (DAPC; Jombart, Devillard, & Balloux, 2010), implemented inadegenet in R (Jombart, & Ahmed, 2011) to evaluate the number of genetic clusters in the data. Principal component analysis was also performed using the R package ade4 v. 1.7-16 (Dray & Dufour, 2007) with a priori groups assigned based on sample type (MS, HV, BP, CF, FF) and data type (GT-seq, ddRADseq) to confirm structuring was independent of sample and data types (Figure S5). The best-fit value of K was tested using the find.clusters function and Bayesian information criterion (BIC), and the chosen value of K was selected based on the lowest BIC value.
To estimate assignment accuracy to each subpopulation and STRUCTURE-derived genetic cluster, we used principal components analyses and Monte Carlo cross-validation procedures implemented in theAssignPOP package in R (Chen et al., 2018). For the subpopulation version of analysis, we removed subpopulations with <10 samples (Norwegian Bay), whereas for the genetic cluster version, we only retained individuals with >0.7 membership to a single genetic cluster. From these data we built a predictive model using a support vector machine (model svm) classification based on training sets composed of the most informative 75% loci and a random sample of 75% of individuals in the dataset. The remaining 25% of individuals were used to test the rate of assignment, which was then averaged across 30 iterations.