Introduction
As next-generation (i.e. massively parallel, short read) sequencing has become the ubiquitous approach for population genetic and phylogenetic investigations, SNP (single-nucleotide polymorphism) datasets have quickly become the standard input data for a wide array of phylogenetic and population genetic analyses (Toews et al., 2015). SNP datasets typically contain thousands to millions of base-calls for each individual sample (i.e., genotypes), located at thousands to millions of variable sites (i.e., SNPs) throughout the genome of the focal taxa. For storing these genotype calls along with associated quality information, the standardized and efficient formatting of the vcf (variant call format) file has become the widely accepted standard (Danecek et al., 2011). By retaining only variant sites (i.e., called SNPs), large population genomic datasets can be stored in vcf files which are manageable both in terms of file size and computational time required for downstream analyses.
As SNP datasets stored and vcf files have become standard input for population genetic and phylogenetic analyses, programs to call SNPs from next generation sequencing data have proliferated rapidly (e.g.,GATK (McKenna et al., 2010), SAMtools (Danecek et al., 2021), Stacks (Rochette et al., 2019), Ddocent (Puritz et al., 2014), ipyrad (Eaton & Overcast, 2020)). While these programs are optimized to call SNPs rapidly and accurately, called SNPs will inevitably suffer from a variety of technical issues such as sequencing error, paralagous assembly, and missing data, all of which need to be addressed before performing downstream analyses (O’Leary et al., 2018). Furthermore, individual samples may suffer from low sequencing coverage or contamination, preventing confident use in downstream analysis, necessitating their removal from the dataset (Cerca et al., 2021). To address these issues, some SNP calling programs contain built in functionality for filtering output SNP datasets (e.g.,GATK and Stacks ), but these filtering options often leave much to be desired for investigators hoping to perform thorough explorations of parameter space and deeply understand the nuances of their particular SNP dataset. This limited functionality results in a gap in many bioinformatic pipelines, especially for SNP datasets generated via reduced-representation genomic sequencing where de novo assembly and rampant missing data often necessitate careful filtering in order to maximize retained signal while minimizing systematic error (O’Leary et al., 2018).
Currently, this bioinformatic gap is often addressed via informal data visualizations implemented across multiple programs and programming languages, as investigators attempt to confirm that technical issues and missing data are not influencing downstream inferences, before choosing a final set of SNP filtering parameters. As reproducibility is becoming more widely acknowledged as critical for the future of science, effectively documenting this often iterative process of investigating, visualizing, and cleaning large datasets continues to be a major challenge. Current state of the art programs for filtering SNP datasets such as GATK and VCFtools (Danecek et al., 2011) are highly efficient and parallelizable, making them especially useful for massive datasets, but their command-line interfaces do not lend themselves easily to graphical visualization, leading many investigators to rely on homebrewed scripts for preliminary data investigation. A common homebrewed approach involves generating summary statistic files using a command-line based program or Unix scripting, for investigation and visualization using the renowned data visualization tools of the R (R Core Team, 2019) computing language. This approach requires moving between scripting languages and writing custom code to perform visualizations, creating a steep learning curve for inexperienced users and resulting in pipelines that may be error prone and difficult to document. Based on the ubiquity of this approach, there is clearly an outstanding need within the field of evolutionary genomics for user-friendly software that automates and streamlines the process of investigating, visualizing, and filtering SNP datasets.
The R package SNPfiltR is designed to fill this bioinformatic gap with a suite of custom functions designed for investigating, visualizing, and filtering reduced-representation SNP datasets within a coherent R-based framework. As input, these functions take SNP datasets stored as standard vcf files, read into an R working environment as ‘vcfR’ objects, which can be performed in a single step using the function read.vcfR() from the R package vcfR (Knaus & Grünwald, 2017). This suite of custom functions from SNPfiltR can then generate automated visualizations of key parameters such as depth, quality, and missing data, and allow users to specify appropriate filtering thresholds based on the nuances of their particular dataset, for rapid filtering directly in R. Functions from SNPfiltR can be used in concert with import and export functions from vcfR , in order to generate an interactive, customizable SNP filtering pipeline, all within a single R script. The package is publicly available on CRAN via the syntax [install.packages(“SNPfiltR”)], the development version is available from GitHub at: (github.com/DevonDeRaad/SNPfiltR) and the entire package is thoroughly documented including descriptions and examples for each function and multiple comprehensive vignettes, at the website: (devonderaad.github.io/SNPfiltR/).