Genetic Variation
Alleles for 10,709,466 biallelic Single Nucleotide Polymorphisms (SNPs)
scored across 2029 Arabidopsis genotypes were retrieved from publicly
available data (Arouisse et al., 2020). The genotypes used are inbred
lines made homozygous through selfing and single-seed descent, so
allelic states can be coded 0 (homozygous for the reference allele) or 1
(homozygous for the alternative allele) with no heterozygotes. We
filtered SNP data to remove SNPs with missing call rate >
0.05 and rare variants with minor allele frequency lower than 0.01. SNPs
were then pruned using a window size of 500kb, a variant step count of
100 and a pairwise linkage threshold r2 = 0.1,
retaining 86,760 SNPs. All filtering and pruning were conducted in PLINK
v190b6.10 (Purcell et al., 2007).
Pruned SNPs were used to compute a genetic similarity matrix (GSM; Speed
& Balding, 2015). The GSM is a square matrix with entries that measure
pairwise similarity between individual genotypes. We compared several
methods of constructing GSMs but found they did not affect model
performance and that a GSM rendered individual markers redundant as
predictors (Appendix S2). Since using a precomputed GSM is more
computationally practical than including numerous SNPs for each model
run, we decided to only quantify genetic variation through an
identity-by-state GSM. Identity-by-state was preferred because it can be
computed for any pair of individuals, including novel ones.