De novo assembly and data filtering (dd-RADseq samples)
Raw reads were first demultiplexed and quality filtered through theprocess_radtags.pl pipeline in stacks v.2.5 (Rochette,
Rivera-Colón, & Catchen, 2019). In the absence of a reference genome
for any of the three species, RAD-seq loci were de novo assembled
independently in each species under the denovo_map.pl pipeline
in stacks. We used the following assembly parameters:m =3 (minimum read depth to create a stack), M =4 (number of
mismatches allowed between loci within individuals), and n=4(number of mismatches allowed between loci within catalogue). We found
an average coverage per species of ~10x (see results). A
consensus on the threshold below which SNP calling may be considered
unreliable is still lacking. However, genotype free estimation of allele
frequency is generally recommended with low to medium coverage
(Korneliussen, Albrechtsen, & Nielsen, 2014). This approach,
implemented in the software angsd v.0.923 (Korneliussen et al.,
2014), has been rarely applied to Rad-seq data (however, see Warmuth and
Ellegren (2019) for an exception) and, to our knowledge, never to
Rad-seq data from non-model organisms, probably due to need of a
reference sequence for the software to work. Here, we followed the
approach of Khimoun et al., (2020) by creating an artificial reference
sequence. First, we used the population script in stacks
to assembly loci present in at least 80% of the individuals (using the
flag r=0.8 ); then, we concatenated the consensus sequences of the
retrieved loci spaced by a stretch of 120 N (unknown) characters (the
same length of the Rad-loci) to facilitate the subsequent mapping. Raw
reads were then mapped back to the novel reference sequence by means of
the bwa-mem algorithm with default parameters (Li & Durbin,
2009). Using custom bash scripts coupled with angsd, we applied
a number of filters to the aligned data and eliminated: i) sites
with coverage <3 (-minIndDepth =3 flag),ii) bad quality bases and poorly aligning reads (-minQ and
-minMapQ and -C flags with default values); iii)poor quality sites based on the per base alignment quality
(-baq=1 flag); : iv) SNPs in the last 5 bp of each locus;v) SNPs heterozygote in at least 80% of individuals; vi)loci with more than 5 SNPs that could potentially be paralogous;vii ) sites with missing data by setting the -minInd flag
to the total number of individuals retained in each species. The
filtered dataset was then used to generate a site allele frequency
likelihood file, with the genotype likelihoods computed with the
SAMtools method (-GL=1 flag), further optimised to compute a
folded site frequency spectrum (SFS) with no missing data for
downstream analyses. An alternative (and simpler) approach would have
been to augment m to achieve an higher coverage (Paris, Stevens,
& Catchen, 2017). However, beside the considerable loss in the number
of assembled loci (and hence of retrieved SNPs), we found by extensive
simulation of in silico Rad experiments that selecting high
coverage loci biases the SFS towards low frequency variants (Mona,
Bertorelle, Benazzo, in preparation). The SFS for C. melanopteruswas estimated directly from the high coverage exon-capture dataset of
Maisano Delser et al. (2019).