Case studies: analysis of USCO nucleotide sequence variation
To infer nucleotide sequence variation and to perform phenetic analyses on the extracted USCO data, such as Bayesian clustering or non-metric multidimensional scaling (NMDS), SNPs were extracted from the USCO nucleotide alignments of the four case studies obtained as described in the previous section. SNP sites were extracted from the multiple nucleotide sequence alignments of diploid consensus sequences of each USCO with the software SNP-sites (Page et al., 2016), excluding low-quality sites masked by the software bcftools with lowercase letters. For this purpose, all outgroup taxa were excluded from the multiple nucleotide sequence alignments. In the dataset of Darwin’s finches, we additionally excluded the divergent ingroup genusCerthidea .
SNPs were filtered in three steps using custom Perl scripts as done by Dietz et al. (2023). First, all non-informative SNP sites (i.e., those in which all individuals except one had the same allele) were removed (removegaps_snp_inf_d.pl); second, SNP sites with missing data or with gaps in more than 50% of individuals were removed (removegaps_d.pl); and third, only SNP sites in a given gene present in the largest number of individuals in the respective gene’s nucleotide sequence alignment were kept (removegaps_snp_d.pl).
Based on the extracted SNP sites, we conducted a population structure analysis by clustering SNPs with the software STRUCTURE v. 2.3.4 (Pritchard et al., 2000) using an MCMC chain length of 50,000 (burn-in of 20,000) and a range of values for the number of ancestral populations (K) from 1 to 10. For each K value, the analysis was repeated ten times and the result with the highest likelihood was chosen. Additionally, we performed NMDS in two dimensions with the software PAST v. 4.03 (Hammer et al., 2001) using only biallelic SNPs. With a custom script (snp-pca_d.pl), genotypes homozygous for the majority allele were coded as 0, those homozygous for the rarer allele as 2 and heterozygous genotypes as 1. NMDS was then performed based on Euclidean distances. We repeated the analysis at least ten times to reduce the risk of reporting results with only locally optimal parameters and chose the results with the lowest stress value (i.e., those with the best fit to the data).
To assess whether different data extraction methods can be combined, we created multiple sequence alignments including the mzl-USCOs extracted with the BUSCO software and those extracted with Orthograph using OrthoDB v. 9 and v. 10. We combined the sequences of each gene obtained from the three approaches in a single dataset for each of the four case studies, using only those 580 genes classified as mzl-USCOs in both versions of OrthoDB. In this dataset, every specimen was consequently represented three times. We aligned the amino acid sequences from the three approaches with MAFFT v. 7.543 (Katoh & Standley, 2013) and inferred the corresponding nucleotide sequence alignments with pal2nal v. 14.1 (Suyama et al., 2006). However, we only used the nucleotide sequence alignments for phylogenetic inference, as they provide more phylogenetic signal for closely related taxa than the amino acid sequence alignments. The nucleotide sequence alignments were concatenated and the resulting supermatrix was phylogenetically analyzed with IQ-TREE as described above, except that only one analysis was conducted per dataset. Trees were rooted at the midpoint. We tested the effect of removing alignment positions with missing data and gaps on the phylogenetic results by removing alignment sites that have missing data or gaps in at least one individual and performing a phylogenetic analysis on the reduced dataset as described above. Analogously, using both data including or excluding positions with missing data or gaps, we performed phylogenetic analyses on multiple nucleotide sequence alignments of each gene and inferred a coalescent-based tree with ASTRAL as described above. Furthermore, we visually inspected the entire multiple nucleotide sequence alignment of each gene in all four case studies and removed alignment regions which contained strongly divergent sequences between extraction approaches and manually corrected obvious misalignments in AliView (Larsson, 2014). These corrected data were then used for coalescent-based and concatenation-based phylogenetic analyses as described above. Additionally, we generated versions of these corrected multiple nucleotide sequence alignments in which sites containing missing data and gaps were removed as described above.