Selection of ORF targets
To identify target ORFs for orthology assessment, we first clustered the contigs of our 12 Agalma transcriptomes into a supertranscriptome using CD-Hit-Est v. 4.8.1 (W. Li & Godzik, 2006) with a 95% similarity threshold, retaining the longest variant per cluster. We predicted ORFs for the supertranscriptome with TransDecoder v. 5.5.0 (Haas & Papanicolaou, 2018) in two steps. First, we identified all likely coding regions, from which we retained long ORFs, after which we selected the best supported ORF per transcript. These ORF predictions were clustered again with CD-Hit-Est at 95% similarity to create homologous clusters. These homologous ORF clusters were expected to still contain a high level of redundancy because of fragmentation, frameshifts, mis-indexing, mis-assembly or the potential existence of isoforms or paralogs. We evaluated orthology in two steps. First, we mapped our ORF clusters against the BUSCO Metazoa_odb9 database (Waterhouse et al., 2017) and against the Unioverse probe set (Pfeiffer et al., 2019) to identify targets that have high chances of being single-copy and orthologous (=candidate genes). This approach does not allow to detect unexpressed homologs or weakly divergent paralogs, so additional follow-up verifications are performed during probe design (see below). We extracted all complete BUSCO hits, which consist of single-copy hits and multi-copy hits. For ORFs with single-copy BUSCO hits the likelihood of multiple gene copies is small, given the gene is included in the single-copy orthologous database of BUSCO, and given the lack of expression variants with >95% similarity after clustering 12 ingroup transcriptomes. Multi-copy BUSCO hits imply that at least two ORF homologous clusters map to a BUSCO gene and these were manually curated to distinguish among the following scenarios: 1) multiple copies may indicate the existence of distant homologs or paralogs, 2) assembly errors may have resulted in the creation of multiple contigs for the same gene, 3) various isoforms may exist of a single expressed RNA fragment. All ORFs that map to an individual BUSCO were merged into a .fasta file after which contigs were aligned with MUSCLE v.3.8.1551 (Edgar, 2004) and the nucleotide and protein sequences of homologs were visually inspected in SeaView v. 5 (Galtier et al., 1996; Gouy et al., 2010). If the presence of paralogs was suspected in this evaluation, we rejected all ORFs for the respective BUSCO hit. If different splicing or minor difficulties in assembly were suspected, the longest ORF was retained.
A similar approach was used to evaluate the 811 loci from the Unioverse probe set, which was developed for anchored hybrid enrichment using the distant Bathymodiolus platifrons genome (Pfeiffer et al., 2019). This probe set contains 173,707 nucleotides, on average 214 per locus. To avoid overlap with targets that were already retained from the abovementioned BUSCO comparisons, we first screened the Unioverse loci against the BUSCO Metazoa_odb9 database and against our already retained ORFs with Yass v. 1.15 (NoƩ & Kucherov, 2005). In total 109 Unioverse loci were accounted for by these verifications, leaving 702 loci to be examined. Our remaining ORF clusters produced hits on all 702 remaining Unioverse loci, and all hits for the same Unioverse locus were compiled and aligned. When several Unioverse loci mapped onto the same ORF, we also performed alignments including all concerned Unioverse loci and all associated ORFs in SeaView. The subsequent evaluation of candidate genes followed the criteria indicated above for our BUSCO evaluation. Finally, we mapped all retained ORF targets and any subregions within these ORFs to one another with Yass to examine the percentage of sequence similarity over a certain alignment length. ORFs were removed if alignment lengths and similarities were judged to potentially interfere with target enrichment.