Novel functions for visualizing and filtering SNP datasets in R
The SNPfiltR package relies on the efficient import and export
functions of the vcfR package to efficiently read vcf files into
the local memory of an R working environment as vcfR objects, and to
write vcfR objects to disc as gzipped vcf files. Once a vcf file has
been read into the local R working environment as a vcfR object, it is
immediately available in proper input format for all SNPfiltRfunctions. Each SNPfiltR function can be run without specified
thresholds or cutoffs, (e.g., hard_filter(vcfR=vcfR.object) ) to
visualize the parameter space that will be filtered, without performing
filtering, allowing users to quickly make informed decisions based on
patterns specific to their datasets, and implement their chosen
filtering thresholds (e.g., hard_filter(vcfR=vcfR.object,
depth=5, gq=30) ). SNPfiltR contains a suite of commonly
implemented filters for genomic datasets, including filtering based on
genotype quality, minimum and maximum read depth, allele balance, number
of alleles present, missing data per sample, missing data per SNP, minor
allele count, and physical linkage. While most of these filters can be
implemented in other programs (e.g., VCFtools and GATK ),SNPfiltR is the first program offering dedicated functions for a
comprehensive suite of SNP visualization and filtering options. Each SNP
filtering function can be implemented or skipped at the discretion of
the user, to build an interactive SNP filtering pipeline customized to
the specific needs of a given genomic dataset.
Beyond simply filtering, I also developed functions to automate the
process of investigating the effects of missing data on a SNP dataset.
The SNPfiltR functions assess_missing_data_pca() andassess_missing_data_tsne() are designed to perform
dimensionality reduction on highly multi-dimensional SNP datasets, using
principal components analysis (PCA) via the R package adegenet (Jombart,
2008) and t-distributed stochastic neighbor embedding implemented via
the R package Rtsne (Krijthe & van der Maaten, 2015),
respectively. Each of these functions then visualizes the similarity
between input samples in two-dimensional space, across user specified
missing data per SNP thresholds. Users also have the option to perform
unsupervised clustering to assign samples to groups without a-priori
information using Partitioning Around Medoids (PAM) implemented
internally via the R package cluster (Maechler et al., 2018), by
setting clustering = TRUE, if they wish to assess the effect of missing
data on objective sample clustering assignments. Finally each of these
functions will generate an additional visualization of sample similarity
in two-dimensional space with samples color-coded by missing data
proportion, allowing the user to visually assess whether missing data is
driving patterns of sample clustering. These investigative functions can
be used in tandem with the functions missing_by_snp() andmissing_by_sample() , in order to ensure that user specified
missing data thresholds both per sample and per SNP are sufficient for
mitigating the effects of missing data in driving patterns of sample
clustering for your specific dataset before performing downstream
population genetic or phylogenetic analyses.