Sequences assembly bioinformatics pipelines
Raw sequence reads were trimmed to 130 base pairs representing the best quality length distribution observed in FastQC v0.11.3 using the default parameter ’phred33’ (Andrews, 2010). The Stacks v2.4 pipeline was used to demultiplex the libraries and cluster loci using the function process_radtaqs with a minimal percentage of identity of 85% within and among individuals (Rochette et al., 2019). A de novo assembly was run for the complete dataset for each species separately. The pipeline was run with the optimal parameters as described in the supplementary methods (Appendix A).
The program populations in the software Stacks (Rochette & Catchen, 2017) was used to produce two filtered genomic datasets for each species that differed in the percentage of missing data and the number of individuals by population (Table S2). The first dataset filtered the loci present in at least 15% of the individuals (filter parameter: -R=0.15), with sample sizes of n =105 individuals forR. flaccida , and n =108 individuals for C. surinamensis . A second dataset was generated by filtering the loci that were present in at least 20% of the individuals (filter parameter: -R=0.20), considering n =80 individuals for R. flaccida , and n =71 individuals for C. surinamensis . Both datasets present a similar number of SNPs after filtering, yet a different amount of missing data. Heretofore, these two datasets will be referred to as the complete dataset which presents over >80% of missing data, and the reduced dataset with >75% of missing data. Tests were done to check if the amount of missing data or reduced sample size of our two datasets, will bias downstream analysis. A full detailed description of the methods used is shown in the supplementary methods (Appendix A).