Evaluation of target regions
Of the 84,484 ART_Illumina reads, 66,804 mapped onto theVenustaconcha genome (79.1%), all of which were unique hits. These hits covered 3,931 of our 5,221 target regions (75.3%), i.e. 631 of 633 (99.7%) of our complete single-copy BUSCO ORFs, all of our complete duplicate BUSCO ORFs, 295 of 296 (99.7%) Unioverse ORFs, 996 of the 1255 stringent UCEs (79.4%; i.e. UCEs found in at least 6 genomes) and 1824 of our 2852 less stringent UCEs (64.0%; UCEs found in 5 of our 7 genomes), indicating a higher efficiency for ORFs than UCEs. This mapping indicated that ORFs were regularly (but not always) retrieved completely on the same Venustaconcha contig, and that we could expect to retrieve multiple exons per ORF. In theVenustaconcha genome these exons were typically larger than 200 nt and often separated from other exons of the same ORF by 1,000s or 10,000s of nt.
Our targets for bait design covered 5,221 genomic regions with a length of 2,272,996 nt. Of the 40,269 raw probes, 37,959 passed quality control (~94.3%). The impact of this filtering on the overall coverage of our target regions was minimal, however, for three UCEs we were not able to develop any probes and for four ORFs the discarded probes resulted in gaps of >300 nt, so that these ORFs were expected to be incompletely covered upon target enrichment.
ORF and UCE recovery
On average, we obtained over 3 million reads per sample (range: 35,246 to 9,132,732), of which (mean±sd) 61.81±13.94% were on target. Of these on-target reads, 60.83±14.15% relate to ORFs whereas 0.97±0.73% to UCEs. There was a weak but significant positive correlation among the total number of reads per sample and the proportion of on-target reads (R 2=0.045, F =4.442, df=1+94, ­p =0.038), but no trend in the total number of reads per phylogenetic clade within Coelaturini (R 2=0.020, F =1.961, df=1+94, ­p =0.165; Fig. S1). We observed unbalanced enrichment of UCEs versus ORFs: Whereas the UCE regions contain 26.17% of the total of targeted nucleotides, the number of UCE kmer hits compared to ORF kmer hits is around 0.05%, indicating a substantial underrepresentation of UCEs compared to ORFs.
On average over 1,102 of the 1,114 ORFs were consistently enriched and mapped for all 95 unionids (857 are consistently recovered for over 50% of their length in all specimens), with the exception of the distant iridinid specimen (dna0240; see Fig. 2). HYBPIPER detected hidden paralogy in at most 2 ORFs per specimen. As the number of reads obtained for a sample decreases, we see a gradual decrease in the recovery of ORFs, which becomes more marked for samples with <500,000 reads (n =12). As to UCEs, we recovered data for up to 1,905 out of the 4,104 UCEs (46.5%), and the coverage per sample was proportional with the number of reads (linear model:r 2=0.557, p <0.001), as was observed for the ORFs (Fig. 2). The combination of 55% and 60% thresholds on sequence coverage and identity, respectively, maximized the total yield over all specimens, but it decreased the number of retrieved UCEs slightly to 1,895. On average 281 UCEs are covered per individual (range 30-473; total of 26,982 regions recovered for 96 samples). The number of recovered UCEs, and the proportion of unique contigs recovered per individual decreased gradually as the thresholds on sequence coverage and identity were altered, with more abrupt decreases when the threshold on %identity was increased to ≥70% (Fig. 3). The consistency with which UCEs are recovered across taxa is low: 37 and 276 UCEs are recovered in >75% and >50% of individuals, respectively. The length distribution of the retained UCEs is highly similar to that of all UCEs (Fig. S2).