Sequences assembly bioinformatics pipelines
Raw sequence reads were trimmed to 130 base pairs representing the best
quality length distribution observed in FastQC v0.11.3 using the default
parameter ’phred33’ (Andrews, 2010). The Stacks v2.4 pipeline was used
to demultiplex the libraries and cluster loci using the function
process_radtaqs with a minimal percentage of identity of 85% within
and among individuals (Rochette et al., 2019). A de novo assembly
was run for the complete dataset for each species separately. The
pipeline was run with the optimal parameters as described in the
supplementary methods (Appendix A).
The program populations in the software Stacks (Rochette &
Catchen, 2017) was used to produce two filtered genomic datasets for
each species that differed in the percentage of missing data and
the number of individuals by population (Table S2). The first dataset
filtered the loci present in at least 15% of the individuals (filter
parameter: -R=0.15), with sample sizes of n =105 individuals forR. flaccida , and n =108 individuals for C.
surinamensis . A second dataset was generated by filtering the loci that
were present in at least 20% of the individuals (filter parameter:
-R=0.20), considering n =80 individuals for R. flaccida ,
and n =71 individuals for C. surinamensis . Both datasets
present a similar number of SNPs after filtering, yet a different amount
of missing data. Heretofore, these two datasets will be referred to as
the complete dataset which presents over >80% of missing
data, and the reduced dataset with >75% of missing data.
Tests were done to check if the amount of missing data or reduced sample
size of our two datasets, will bias downstream analysis. A full detailed
description of the methods used is shown in the supplementary methods
(Appendix A).