GT-seq genotyping and genotype calling methods
A full description of GT-seq panel optimization can be found inSupplemental Information: Panel Optimization . Using optimized GT-seq conditions and primers for our final GT-seq panel, we prepared two libraries for fecal samples (CF, FF) and another three for tissue samples (HV, BP, MS). The fecal libraries contained all 150 fecal samples, as well as the 21 FF duplicates taken to assess subsample variation in genotyping error. We included 284 tissue samples in the latter libraries and also included two technical replicates (BP) to assess genotyping error for GT-seq. Genotyping error was measured as percent discordance between technical replicates. Within our total of 457 samples, we included 65 paired sets of MS and CF collected from the same individuals. This allowed us to compare GT-seq genotyping error across sample types and obtain a “best case scenario” estimate of genotyping error for FF samples. Here, we assume that the muscle genotype is “correct” and calculate the genotyping error (percent discordance) for CF as a proxy for FF samples. Prior to calculating genotyping error within and among sample types, samples with >50% missing data were removed from the GT-seq dataset, as we considered samples with >50% data (i.e. genotypes for at least half of panel loci) to have been successfully sequenced. To assess the validity of the sex identities provided by our two GT-seq loci, we compared the GT-seq determined polar bear sex to hunter-provided sex identification for 293 samples that were successfully genotyped at >50% loci and for which we had field data.
Library preparation followed the original protocols of Campbell et al. (2015), modified based on our pilot tests (See Supplemental Information: Panel Optimization ), and using only primers for a final optimized panel of 327 SNPs. Libraries were sequenced using an Illumina MiSeq at Queen’s University. We used the GT-seq pipeline available on GitHub (https://github.com/GTseq/GTseq-Pipeline) for filtering to a minimum depth of 10 and genotype calling, as suggested by Campbell et al. (2015). However, we were also interested in examining genotype discordance between different SNP calling models, as some discrepancy has been found between the GT-seq pipeline and other workflows used to call RADseq data (Schmidt et al., 2020). As BCFTOOLS (Li, 2011) is a common genotyping tool and was used to call genotypes for our original ddRADseq dataset (Jensen et al., 2020), we used this same workflow to process our raw GT-seq sequencing reads. Briefly, reads from GT-seq were aligned to the polar bear reference genome (assembly version UrsMar_1.0, PMID: 24813606) using the BWA-MEM v0.7.17 aligner (Li & Durbin, 2009). Alignments were sorted, indexed, and read pairs were fixed using tools from the SAMtcools v1.9 suite (Li et al., 2009). A target file of the GT-seq assay SNP positions was used with BCFtools mpileup to call and produce a VCF file of the targeted sites. The VCF was filtered in VCFtools (Danecek et al., 2011) for a minimum depth cut off of either 6 (dataset BCF-6) or 10 (dataset BCF-10). Output files for each workflow (GT-seq, BCF-6, BCF-10) were converted to genpop format to compare percent missing data by locus. Average mismatch rates for all 457 samples were also calculated between the GT-seq pipeline and the BCFTOOLS workflows. The dataset resulting from the calling method with the least missing data (BCF-6) was used to calculate genotyping error, percent missing data, and for all subsequent population genetic analyses.