Case studies: analysis of USCO nucleotide sequence variation
To infer nucleotide sequence variation and to perform phenetic analyses
on the extracted USCO data, such as Bayesian clustering or non-metric
multidimensional scaling (NMDS), SNPs were extracted from the USCO
nucleotide alignments of the four case studies obtained as described in
the previous section. SNP sites were extracted from the multiple
nucleotide sequence alignments of diploid consensus sequences of each
USCO with the software SNP-sites (Page et al., 2016), excluding
low-quality sites masked by the software bcftools with lowercase
letters. For this purpose, all outgroup taxa were excluded from the
multiple nucleotide sequence alignments. In the dataset of Darwin’s
finches, we additionally excluded the divergent ingroup genusCerthidea .
SNPs were filtered in three steps using custom Perl scripts as done by
Dietz et al. (2023). First, all non-informative SNP sites (i.e., those
in which all individuals except one had the same allele) were removed
(removegaps_snp_inf_d.pl); second, SNP sites with missing data or
with gaps in more than 50% of individuals were removed
(removegaps_d.pl); and third, only SNP sites in a given gene present in
the largest number of individuals in the respective gene’s nucleotide
sequence alignment were kept (removegaps_snp_d.pl).
Based on the extracted SNP sites, we conducted a population structure
analysis by clustering SNPs with the software STRUCTURE v. 2.3.4
(Pritchard et al., 2000) using an MCMC chain length of 50,000 (burn-in
of 20,000) and a range of values for the number of ancestral populations
(K) from 1 to 10. For each K value, the analysis was repeated ten times
and the result with the highest likelihood was chosen. Additionally, we
performed NMDS in two dimensions with the software PAST v. 4.03 (Hammer
et al., 2001) using only biallelic SNPs. With a custom script
(snp-pca_d.pl), genotypes homozygous for the majority allele were coded
as 0, those homozygous for the rarer allele as 2 and heterozygous
genotypes as 1. NMDS was then performed based on Euclidean distances. We
repeated the analysis at least ten times to reduce the risk of reporting
results with only locally optimal parameters and chose the results with
the lowest stress value (i.e., those with the best fit to the data).
To assess whether different data extraction methods can be combined, we
created multiple sequence alignments including the mzl-USCOs extracted
with the BUSCO software and those extracted with Orthograph using
OrthoDB v. 9 and v. 10. We combined the sequences of each gene obtained
from the three approaches in a single dataset for each of the four case
studies, using only those 580 genes classified as mzl-USCOs in both
versions of OrthoDB. In this dataset, every specimen was consequently
represented three times. We aligned the amino acid sequences from the
three approaches with MAFFT v. 7.543 (Katoh & Standley, 2013) and
inferred the corresponding nucleotide sequence alignments with pal2nal
v. 14.1 (Suyama et al., 2006). However, we only used the nucleotide
sequence alignments for phylogenetic inference, as they provide more
phylogenetic signal for closely related taxa than the amino acid
sequence alignments. The nucleotide sequence alignments were
concatenated and the resulting supermatrix was phylogenetically analyzed
with IQ-TREE as described above, except that only one analysis was
conducted per dataset. Trees were rooted at the midpoint. We tested the
effect of removing alignment positions with missing data and gaps on the
phylogenetic results by removing alignment sites that have missing data
or gaps in at least one individual and performing a phylogenetic
analysis on the reduced dataset as described above. Analogously, using
both data including or excluding positions with missing data or gaps, we
performed phylogenetic analyses on multiple nucleotide sequence
alignments of each gene and inferred a coalescent-based tree with ASTRAL
as described above. Furthermore, we visually inspected the entire
multiple nucleotide sequence alignment of each gene in all four case
studies and removed alignment regions which contained strongly divergent
sequences between extraction approaches and manually corrected obvious
misalignments in AliView (Larsson, 2014). These corrected data were then
used for coalescent-based and concatenation-based phylogenetic analyses
as described above. Additionally, we generated versions of these
corrected multiple nucleotide sequence alignments in which sites
containing missing data and gaps were removed as described above.