Target enrichment and sequencing
We validated our selection of targets, bait design and our molecular protocols with four target enrichment reactions. Pools were created at equimolar concentrations and enriched with a ~19 h incubation period at 65 °C and 14-16 cycles in the post-hybridization PCR reaction. After enrichment, we purified the resulting products using 0.8X magnetic beads and rehydrated the enriched pool in 30 µl TLE. Post-enrichment libraries were quantified with Qubit and BioAnalyser, after which the results of multiple reactions were pooled at equimolar concentration and sequenced paired-end (2×150 bp) on an Illumina NextSeq 500 of the LIGAN-MP Genomics Platform of UMR8199.
Bioinformatic analysis of sequencing results
ORFs with HybPiper. Demultiplexed raw reads were cleaned with TrimGalore!, CutAdapt and Trimmomatic, performing iterative FastQC quality controls at each step. The cleaned paired-end reads of each specimen were subjected to the HYBPIPER workflow v.1.3.1 (Johnson et al., 2016). This workflow starts by mapping reads to our ORF targets and sorting them per ORF in separate directories after removal of paired duplicated reads with BLASTX and BWA v. 0.7.17 (H. Li & Durbin, 2009). Subsequently, the reads per ORF are subjected to de novo assembly using SPAdes v. 3.14.0 (Bankevich et al., 2012) resulting in the construction of a supercontig per ORF and per specimen. The resulting contigs are sorted and the boundaries between exons and flanking regions are predicted with EXONERATE v. 2.4.0 (Slater & Birney, 2005). Alignments of the supercontigs, which include exons and intronic/intergenic flanking regions, were performed with MAFFT v.7.475 (Katoh & Standley, 2013) and MUSCLE, using our ORF targets to verify exon predictions.
UCEs with Phyluce. Demultiplexed and cleaned reads were also mapped to our UCE targets and treated with PHYLUCE. We assembled contigs with Trinity, which subsequently underwent orthology detection, paralog removal and the matching of contigs to targets with lastz (Harris, 2007), after which they were aligned with MAFFT or MUSCLE. Two essential parameters in the PHYLUCE pipeline are –min_identity and –min_coverage (–min_kmer_coverage was set to two), and we examined combinations of each factor between 50 and 80% at intervals of 5% (49 combinations in total). Many of these combinations are more permissive than the recommendations of Bossert and Danforth (2017), but we used additional downstream processing to eliminate potential contaminant sequences. Parameter combinations were evaluated using the proportion of unique contigs retrieved and the total number of UCEs retrieved for all 96 samples. The resulting data were analysed in R v. 3.6.1 (R Core Team, 2019).
Enrichment balance between ORFs and UCEs. To verify the enrichment balance between ORFs and UCEs, we examined the proportion of reads that mapped on either type of target. This task has been performed starting from the reads that were identified with HYBPIPER and PHYLUCE as mapping to ORF or UCE targets, respectively, which also provides information on the specificity of our enrichment strategy. More detailed information is obtained by extracting kmers of 18 to 21 bases from various .fasta files with Jellyfish v. 2.2.10 (Marcais & Kingsford, 2011), i.e. first from .fasta files that contain the ORF and UCE target sequences, and then for a sample of specimens in our dataset. Each kmer that occurs in both the specimen and ORF dictionary is considered an ORF hit, whereas each kmer that occurs in the specimen and UCE dictionary is considered an UCE hit. The ratio of hits of both types indicates the enrichment balance between both sets of targets.
Mitogenome skimming.Unionoida and some related bivalves have double uniparentally inherited mitochondria, which appears to be unique within the animal kingdom and which may play a role in sex determination (Breton et al., 2011; Zouros et al., 1994). Against this background, and the abundant use of mitochondrial fragments in previous phylogenetic inferences for Unionidae (see above), we performed mitogenome skimming in two steps. The tissues we used for genomic library preparation predominantly or exclusively contain maternally-inherited mitochondria (Breton et al., 2011; Froufe et al., 2020). First, we aligned reads on the mitochondrial genome of Pygandon grandis (NCBI GenBank: NC_013661; Breton et al., 2009) with Bowtie2 v. 2.3.5 (Langmead & Salzberg, 2012) to produce a .bam file, which was sorted and indexed with functions of SAMTools. Subsequently, the files for all individuals were merged with SAMTools and visualized with IGV. This merged .bam file was converted into .fastq files after which the reads were assembled with SPAdes. The resulting assembly was aligned to the Pyganodon grandis mitogenome and visually inspected to retain unambiguous contigs. The retained contigs were used to recover reads for 48 Coelaturini specimens from the Malawi Basin using again Bowtie2, SAMTools and SPAdes, after which individual assemblies were reassembled with Trinity. The total assembly was evaluated and the procedure was iterated each time with laterally extended starting contigs. Annotations were performed with MITOS (Bernt et al., 2013).
Macroevolutionary analyses
Alignment of UCE and ORF data. A single transcriptome was available in NCBI GenBank for the subfamily Parreysiinae at the start of these analyses, i.e. that of Scabies phaselus (Lea, 1856) (SRX5281799; Pfeiffer et al., 2019), which was hence used as outgroup for phylogenetics. We used HYBPIPER as described above to recover information on our ORF targets for this outgroup. Phylogenetic analysis of the ORFs was performed using the supercontigs reconstructed for ingroup taxa with HYBPIPER. These supercontigs contain exons and intronic/intergenic flanking regions; alignments were produced with MAFFT. The UCEs were processed for phylogenetic analysis with PHYLUCE. Alignments for each UCE locus were obtained using phyluce_align_seqcap_align with MAFFT as aligner and without edge-trimming. Phylogenetic datasets of both UCEs and ORFs were filtered to retain target loci with >50% taxon completeness. ORF and UCE alignments were trimmed separately using BMGE v.1.2 (Cruscuolo & Gribaldo, 2010) with a maximum gap rate per sequence and character of 0.3 and a maximum entropy threshold of 0.4 to remove ambiguous regions. After concatenating trimmed single-locus alignments with AMAS v. 1.0 (Borowiec, 2016), we used Spruceup v. 2020.2.19 (Borowiec, 2019) to detect outliers, which were replaced as missing data (windows size=20, overlap=15, criterion=lognorm, cutoff=0.99). Alignment statistics were calculated with AMAS.
Phylogenetic inference. We performed phylogenetic analysis with Maximum Likelihood (ML) on concatenated datasets for ORFs and UCEs separately to evaluate the congruence between both. We used AMAS to generate a partition file for the UCEs with the sliding-window site characteristics method based on site entropies (SWSC-EN; Tagliacollo & Lanfear, 2018) to define the limits of UCE ‘core’ and flanking regions. A single partition was considered for the supercontig of each ORF. Concatenated datasets and the partitioning information were subjected to phylogenetic inference in IQTREE v. 2.0.3 (Minh et al., 2020), using the integrated ModelFinder (Kalyaanamoorthy et al., 2017) to determine the best-fit substitution model for each partition. We used 1000 bootstrap replicates; branch support values were calculated with a Shimodaira-Hasegawa approximate likelihood ratio test (SH-aLRT; Guindon et al., 2010) and ultrafast Bootstrap (UFBoot2; Hoang et al., 2018).