Low-coverage WGS for population assignment
In addition to elucidating fine-scale migratory connectivity patterns in the American Redstart, our results provide important considerations for other population assignment studies using lcWGS. We found that balancingeffective sample sizes of the source populations to within one effective individual of each other was essential for accurate assignment. Even when the actual number of individuals used per population was the same, variation in mean depth (1.3X – 1.9X) between populations skewed the effective sample sizes, resulting in decreased assignment accuracy. Other studies with known genotypes from RADseq have demonstrated the influence of actual sample size on overall assignment accuracy but not how it affects assignment bias . The effective sample sizes needed per population for accurate assignment and the degree of standardizing these values will depend on the population structure of the study system. For example, study systems with higher genetic differentiation between populations may not need to finely standardize effective sample size to achieve high assignment accuracy. We suggest that other population assignment studies similarly evaluate the influence of source population effective sample size on known source individuals before assigning individuals of unknown origin. Reducing the effective sample size of a sampled population can be achieved by either removing individuals or down sampling the read depth. In this study, we chose to remove individuals, and used the individuals’ effective sample sizes as a guide for how many individuals to remove from each population (resulting in 21 – 27 samples per population). For studies with smaller sample sizes, it may be worthwhile to investigate if retaining all individuals, but down sampling reads is a better alternative for standardizing effective sample sizes to retain more variation from individuals.
Importantly, here we demonstrate that individuals with very low whole genome coverage (0.01X – 0.1X) can still be accurately assigned to source populations with sufficient effective sample sizes. These results suggest that increasing the number of samples and decreasing individual sequencing depth is an effective study design strategy for population assignment. For migratory connectivity studies, increased sampling (both number of individuals at each location and the number of locations sampled) across nonbreeding stages of the annual cycle can drastically improve our understanding of population-level connectivity at low cost. Combined with cost-effective approaches for library preparation (e.g. ), lcWGS is increasingly becoming economically feasible for a wide-range of studies. However, a trade-off with lcWGS is that the sequence data processing requires additional costs associated with time spent on the bioinformatics analysis. For studies interested in population assignment with a large number of samples, increasing the number of samples per lane, thereby decreasing the mean average sequencing depth, may make lcWGS economically feasible compared to other sequencing methods. For a comprehensive review of coverage guidelines for different types of analyses with low-coverage WGS data see Lou et al. (2021).
An interesting aspect of our results was that all posterior probabilities of assignment were > 0.8, even for potentially admixed individuals. A standard method to determine assignment confidence in population assignment studies is to use a cutoff value for posterior probabilities of assignment . Individuals with low posterior probabilities of assignment (e.g., < 0.8) can be highly admixed. Thus, it is inaccurate to classify them as from a specific population. However, we suspect that with lcWGS data, the high prevalence of loci with single read results in the likelihood being highest for a homozygous genotype. Thus, admixed individuals may “switch” their population of maximum likelihood depending on the loci used for assignment. Our use of an assignment consistency threshold addressed this concern by creating subsets of genomic data for population assignment to determine if individuals could reliably be assigned to a single population when different loci were used. Testing the assignment consistency threshold with known source individuals revealed three individuals with inconsistent assignment (< 0.8, i.e., 8 out of 10 genomic datasets) and were likely admixed between pure Northern Temperate and Southern Temperate populations. These results highlight that the consistency of assignment may be more reliable than posterior probabilities for confidently assigning individuals of unknown origin. Further development of spatially explicit assignment methods for genotype likelihood data would be helpful for determining the likely origin of admixed individuals at the periphery of source populations.