Discussion
Historically, programs designed for performing computationally intensive bioinformatic processes have rarely been implemented in the R language because the requirement that datasets be read into local memory can cause computational bottlenecks with large input file sizes. Here I showed that the R package SNPfiltR can be used to filter moderate sized reduced-representation SNP datasets with runtimes comparable to state-of-the-art programs implemented in highly efficient languages such as Perl and C++. While benchmarking confirmed that reading large files into the local memory of an R working environment scales poorly with increasing input file size, the vcfR and SNPfiltR packages can be used in tandem to read and quality filter a SNP dataset containing 50M genotypes and associated quality information in less than two minutes on a personal laptop. This size SNP dataset (50M genotypes, or 500K genotypes for 100 samples) is realistic for a set of unfiltered SNP calls resulting from a moderate to large sized reduced-representation genomic sequencing project, indicating that the computational power of the R language has been generally overlooked for the purposes of processing and filtering reduced-representation genomic SNP datasets. SNPfiltR takes advantage of this previously overlooked computational power, and unlike existing programs designed for SNP filtering, harnesses the widely commended data visualization capabilities of R, allowing users to design an interactive and customizable SNP filtering pipelines within a single R script.
While many existing R packages are capable of working with SNP data, no existing R package contains functions for automated visualization and filtering of SNP data comparable to those offered by SNPfiltR . A few packages focus on directly reading and manipulating SNP data (e.g.,vcfR (Knaus & Grünwald, 2017) and dartR (Gruber et al., 2018)), but largely require custom scripting using R syntax if users wish to filter and visualize their SNP datasets, leaving a need for automated SNP visualization and filtering functions. SNPfiltR is complementary to these packages, extending their functionalities with modular functions that automate key visualization and filtering steps, allowing the rapid generation of full SNP filtering pipelines in R. Notably, functions from the SNPfiltR package rely on vcfR objects as input, which can be directly read in from vcf files using the function read.vcfR() from the vcfR package. For this reason, we strongly recommend that users of the SNPfiltR package also cite the vcfR package as part of their integrative SNP filtering pipelines. A suite of additional R packages exist for performing downstream phylogenetic and population genetic analyses on high-quality SNP datasets (e.g., APE (Paradis & Schliep, 2019),stAMPP (Pembleton et al., 2013), SNPrelate (Zheng et al., 2012), adegenet (Jombart, 2008), sambaR (de Jong et al., 2021), and introgress (Gompert & Buerkle, 2010)).SNPfiltR is complementary to these packages as well, as eachSNPfiltR function returns a filtered vcfR object which can be easily converted into a myriad of object classes within R for further analysis using any of these dedicated population genetic programs.
It is widely accepted that the universe of elegant, open-source R based tools such as Rstudio and Rmarkdown allow for exceptional interactivity and reproducibility (Gandrud, 2018). Additionally, the performance benchmarking results presented here indicate that the computational power of the R programming language is sufficient for analyzing most reduced-representation SNP datasets, despite that this practice seems relatively rare. The SNPfiltR package takes advantage of this previously unrecognized opportunity and provides custom functions designed to fully integrate the investigation, visualization, and filtering of a SNP dataset into a single coherent R framework. The filtering functions offered by SNPfiltR perform competitively with current state of the art SNP filtering programs on moderately sized datasets, indicating that bioinformaticians ought to consider implementing fully R-based pipelines for streamlining the often complicated and iterative process of optimizing filtering parameters for next-generation sequencing datasets. By extending the current bioinformatic tools available in R for filtering SNP datasets, theSNPfiltR package will allow users to spend less time investigating and testing filtering parameters, and more time resolving evolutionary mysteries with genomic data.