Characterization of neutral variation and SNPs under selection
Before performing the genetic structure analysis, we use PLINK v1.9 software (Purcell et al. 2007) to prune the SNPs according to their linkage disequilibrium, estimated by correlation coefficients between SNPs. This filtering step was needed because the subsequent structure analysis does not take into account linkage disequilibrium, and this may lead linked SNPs to bias the grouping of individuals. Genetic structure analyses were performed with the Admixture v.1.3 software (Alexander et al. 2009), which allowed us to estimate the ancestries of each individual through maximum likelihood. To determine the number of groups for which the genetic structure model had more predictive power, we used cross validation errors (cv-errors).
We also applied another filtering step before doing the outlier analysis to minimize the false positive rate. We discarded loci whose minor allele frequencies were < 0.05 in any population (thus excluding privative alleles) and loci that could not be sequenced in at least 75% of the individuals in each population. The resulting database consisted of 6,421 SNPs. To identify SNPs putatively under selection, we performed an outlier analysis with Bayescan v.2.1. (Foll and Gaggiotti, 2008), a very conservative method which is not prone to false positives, and is very useful when the number of populations is low (Foll and Gaggiotti 2008). This program uses a logistic regression to split the FST coefficients into a population-specific effect (β) and a locus-specific effect (α). We selected loci with α > 0, suggesting positive selection, and a false discovery rate (corrected by multiple testing) of q < 0.05.
Once this was done, we tried to annotate SNPs with the highest values of α. For this purpose, we run a BLASTn analysis against all the NCBI database. In addition, a Chi2 analysis was performed to determine if any of these loci had allelic frequencies with significant deviations from what was expected under Hardy-Weinberg conditions (H-W). This was done because the environmental differences described above between the two populations could lead not only to divergent selective pressures in some SNPs, but also to respond to a selective pressure present in one of the populations that is absent in the other (in which case, the deviation of allelic frequencies from expected under H-W conditions should occur only in that population).