GT-seq genotyping and genotype calling methods
A full description of GT-seq panel optimization can be found inSupplemental Information: Panel Optimization . Using optimized
GT-seq conditions and primers for our final GT-seq panel, we prepared
two libraries for fecal samples (CF, FF) and another three for tissue
samples (HV, BP, MS). The fecal libraries contained all 150 fecal
samples, as well as the 21 FF duplicates taken to assess subsample
variation in genotyping error. We included 284 tissue samples in the
latter libraries and also included two technical replicates (BP) to
assess genotyping error for GT-seq. Genotyping error was measured as
percent discordance between technical replicates. Within our total of
457 samples, we included 65 paired sets of MS and CF collected from the
same individuals. This allowed us to compare GT-seq genotyping error
across sample types and obtain a “best case scenario” estimate of
genotyping error for FF samples. Here, we assume that the muscle
genotype is “correct” and calculate the genotyping error (percent
discordance) for CF as a proxy for FF samples. Prior to calculating
genotyping error within and among sample types, samples with
>50% missing data were removed from the GT-seq dataset, as
we considered samples with >50% data (i.e. genotypes for
at least half of panel loci) to have been successfully sequenced. To
assess the validity of the sex identities provided by our two GT-seq
loci, we compared the GT-seq determined polar bear sex to
hunter-provided sex identification for 293 samples that were
successfully genotyped at >50% loci and for which we had
field data.
Library preparation followed the original protocols of Campbell et al.
(2015), modified based on our pilot tests (See Supplemental
Information: Panel Optimization ), and using only primers for a final
optimized panel of 327 SNPs. Libraries were sequenced using an Illumina
MiSeq at Queen’s University. We used the GT-seq pipeline available on
GitHub
(https://github.com/GTseq/GTseq-Pipeline)
for filtering to a minimum depth of 10 and genotype calling, as
suggested by Campbell et al. (2015). However, we were also interested in
examining genotype discordance between different SNP calling models, as
some discrepancy has been found between the GT-seq pipeline and other
workflows used to call RADseq data (Schmidt et al., 2020). As BCFTOOLS
(Li, 2011) is a common genotyping tool and was used to call genotypes
for our original ddRADseq dataset (Jensen et al., 2020), we used this
same workflow to process our raw GT-seq sequencing reads. Briefly, reads
from GT-seq were aligned to the polar bear reference genome (assembly
version UrsMar_1.0, PMID: 24813606) using the BWA-MEM v0.7.17 aligner
(Li & Durbin, 2009). Alignments were sorted, indexed, and read pairs
were fixed using tools from the SAMtcools v1.9 suite (Li et al., 2009).
A target file of the GT-seq assay SNP positions was used with BCFtools
mpileup to call and produce a VCF file of the targeted sites. The VCF
was filtered in VCFtools (Danecek et al., 2011) for a minimum depth cut
off of either 6 (dataset BCF-6) or 10 (dataset BCF-10). Output files for
each workflow (GT-seq, BCF-6, BCF-10) were converted to genpop format to
compare percent missing data by locus. Average mismatch rates for all
457 samples were also calculated between the GT-seq pipeline and the
BCFTOOLS workflows. The dataset resulting from the calling method with
the least missing data (BCF-6) was used to calculate genotyping error,
percent missing data, and for all subsequent population genetic
analyses.