Missing data filtering for a scrub-jay RADseq SNP dataset
I then utilized the dedicated visualization tools offered bySNPfiltR to investigate patterns of missing data by individual
sample and by SNP for this quality filtered scrub-jay SNP dataset (Fig.
3). The function missing_by_sample() reveals that missing data
is distributed relatively equally across a priori identified species
groups, and that with all 115 samples included, there are hardly any
SNPs that reach a 90% completeness threshold. A visualization of the
proportion of missing genotype calls in each sample shows that samples
vary along a relatively continuous distribution from missing less than
20% of genotype calls to missing nearly 100% of genotype calls. Using
the missing_by_sample() function, I filtered with a proportion
missing genotypes per sample threshold of 81%, resulting in 20 samples
being dropped from the dataset (Fig. 3). Because SNPs may have become
invariant if all minor allele genotypes were removed when these samples
were dropped, I again implemented a minor allele count filter, with a
minimum of one minor allele genotype per SNP, to remove invariant sites,
resulting in .61% of remaining SNPs being dropped.
I then used the SNPfiltR function missing_by_snp() to
visualize the proportion of missing data in each sample across a
reasonable set of potential per-SNP completeness thresholds (Fig. 3).
This visualization shows a continuous distribution of missing data
within retained samples and no visible outlier samples, indicating that
we have successfully dropped problematic samples from the dataset.
Dotplots show a strong negative correlation between total proportion
missing data and the total number of SNPs retained in the dataset,
across potential per-SNP filtering thresholds. I chose to implement a
per-SNP completeness cutoff of 85% using the functionmissing_by_snp() , resulting in a final, quality and missing
data filtered SNP dataset containing 95 samples, 16,307 SNPs, and 5.7%
total missing genotypes (Fig. 3).
To ensure that the implemented 85% missing data threshold effectively
prevents patterns of missing data within individuals from driving
overall clustering patterns, I then used the functionassess_missing_data_pca() to visualize sample clustering
across 75% and 85% completeness per SNP completeness thresholds (Fig.
4). At both thresholds, all samples visually cluster according to a
priori assignment to species groups. When samples are colored according
to proportion missing data, it becomes evident that within species
groups, samples with the most missing data are clustered the least
tightly, indicating increased uncertainty in assignment. Between the
75% and 85% per SNP completeness thresholds, the more restrictive
threshold slightly reduces the effect of missing data in these most
loosely assigned samples (Fig. 4). Sample clustering using t-SNE reveals
additional population substructure within species groups and shows no
indication that missing data is driving patterns of clustering either
between or within groups (Fig. 4). A final filter for physical linkage,
using the SNPfiltR function distance_thin() to remove all
SNPs separated by less than 500 base-pairs, resulted in a quality and
missing data filtered, unlinked SNP dataset of 2,803 SNPs ready for
input in downstream analyses.