De novo assembly and data filtering (dd-RADseq samples)
Raw reads were first demultiplexed and quality filtered through the process_radtags.pl pipeline in stacks v.2.5 (Rochette, Rivera-Colón, & Catchen, 2019). In the absence of a reference genome for any of the three species, RAD-seq loci were de novo assembled independently in each species under the denovo_map.pl pipeline in stacks. We used the following assembly parameters:m =3 (minimum read depth to create a stack), M=4 (number of mismatches allowed between loci within individuals), and n=4(number of mismatches allowed between loci within catalogue). We found an average coverage per species of ~10x (see results). A consensus on the threshold below which SNP calling may be considered unreliable is still lacking. However, genotype free estimation of allele frequency is generally recommended with low to medium coverage (Korneliussen, Albrechtsen, & Nielsen, 2014). This approach, implemented in the software angsd v.0.923 (Korneliussen et al., 2014), has been rarely applied to Rad-seq data (however, see Warmuth and Ellegren (2019) for an exception) and, to our knowledge, never to Rad-seq data from non-model organisms, probably due to need of a reference sequence for the software to work. Here, we followed the approach of Heller et al. (2021) and Khimoun et al. (2020) by creating an artificial reference sequence. First, we used the population script in stacks to assemble loci present in at least 80% of the individuals (using the flag r=0.8 ); then, we concatenated the consensus sequences of the retrieved loci spaced by a stretch of 120 N (unknown) characters (the same length of the Rad-loci) to facilitate the subsequent mapping. Raw reads were then mapped back to the novel reference sequence by means of the bwa-mem algorithm with default parameters (Li & Durbin, 2009). Using custom bash scripts coupled with angsd, we applied a number of filters to the aligned data and eliminated: i) sites with coverage <3 (-minIndDepth =3 flag),ii) bad quality bases and poorly aligning reads (-minQ and -minMapQ and -C flags with default values); iii)poor quality sites based on the per base alignment quality (-baq=1 flag); : iv) SNPs in the last 5 bp of each locus;v) SNPs heterozygote in at least 80% of individuals; vi)loci with more than 5 SNPs that could potentially be paralogous;vii ) sites with missing data by setting the -minInd flag to the total number of individuals retained in each species. The filtered dataset was then used to generate a site allele frequency likelihood file, with the genotype likelihoods computed with the SAMtools method (-GL=1 flag), further optimised to compute a folded site frequency spectrum (SFS) with no missing data for downstream analyses. An alternative (and simpler) approach would have been to augment m to achieve an higher coverage (Paris, Stevens, & Catchen, 2017). However, beside the considerable loss in the number of assembled loci (and hence of retrieved SNPs), we found by extensive simulation of in silico Rad experiments that selecting high coverage loci biases the SFS towards low frequency variants (Mona, Bertorelle, Benazzo, in preparation). The SFS for C. melanopteruswas estimated directly from the high coverage exon-capture dataset of Maisano Delser et al. (2019).