Figure captions:
Fig. 1. Histogram showing the number of mzl-USCO gene pairs analyzed in this study which occur on the same chromosome in a given proportion of the examined taxa. The histogram shows that the proportion of genomes in which a gene pair occurs on the same chromosome is typically rather small.
Fig. 2. Distribution of median distances between neighboring mzl-USCO genes, in nucleotides divided by genome size. Left: based on real USCO data across all taxa, right: based on a random selection of protein-coding genes for each taxon. Lines connect dots belonging to the same taxon.
Fig. 3. Phylogenetic signal and systematic correlation of distances between neighboring USCOs with major metazoan lineages. A: PC axes 1 (left tree) and 2 (right tree) from a PCA on frequencies of size classes of Metazoa-level USCO distances, mapped onto the Metazoa phylogeny based on concatenated amino acid sequences. B: Plot of axes 1 and 2 from the same PCA, showing a clustering of major metazoan lineages (Protostomia and Deuterostomia with unfilled and filled color symbols, respectively).
Fig. 4. Data yield and results of analyses on mzl-USCOs extracted from Drosophila WGS reads when applying three different USCO extraction methods: A: Number of mzl-USCOs recovered per number of specimens; B: ASTRAL trees based on generated USCO datasets; C: Outcome of SNP clustering analyses with STRUCTURE; D: NMDS plots of SNP similarity.
Fig. 5. Species delimitation of the four case studies based on the programs tr2 and SODA on each data set from the three different extraction methods. Colored boxes indicate that inferred species entities match with currently recognized morphospecies.
Figure S1. Proportion of pairwise sequence overlap in the concatenated alignment of USCO loci between pairs of chromosome-level annotated metazoan genomes.
Figure S2. Maximum likelihood phylogenetic tree based on concatenated amino acid USCO sequences of all analyzed chromosome-level annotated genomes of Metazoa. Numbers above branches are support values from approximate likelihood ratio tests and ultrafast bootstrapping.
Figure S3. Maximum likelihood phylogenetic tree based on concatenated nucleotide USCO sequences (codon positions 1 and 2) of all analyzed chromosome-level annotated genomes of Metazoa. Numbers above branches are support values from approximate likelihood ratio tests and ultrafast bootstrapping.
Figure S4. Multispecies coalescent-based phylogenetic tree based on gene trees of amino acid USCO sequences of all analyzed chromosome-level annotated genomes of Metazoa. Numbers above branches are local posterior probabilities.
Figure S5. Multispecies coalescent-based phylogenetic tree based on gene trees of nucleotide USCO sequences (codon positions 1 and 2) of all analyzed chromosome-level annotated genomes of Metazoa. Numbers above branches are local posterior probabilities.
Figure S6 . Quotient of median distance between neighboring mzl-USCOs to the median distance between neighboring randomly selected annotated protein-coding genes, mapped onto the Metazoa phylogeny based on concatenated amino acid sequences.
Figure S7. Axes 1 and 2 of a PCA on frequencies of size classes of distances between neighboring Metazoa-level USCOs mapped onto the Metazoa phylogeny based on concatenated amino acid sequences (detailed version with taxon names of analyzed chromosome-level genomes).
Figure S8. Number of mzl-USCOs recovered per number of specimens when applying different USCO extraction methods.
Figure S9. Proportion of pairwise sequence overlap in the concatenated alignment of USCO loci between pairs of specimens within each case study (Anopheles , Drosophila , Heliconius , Darwin’s finches) analyzed in the present investigation, sorted by extraction method (BUSCO, Orthograph + OrthoDB v. 9, Orthograph + OrthoDB v. 10).
Figure S10. Phylogenetic trees of Anopheles species inferred with concatenated USCO nucleotide sequences (above) and with the multispecies coalescent (below) generated with different USCO extraction methods.
Figure S11. Phylogenetic trees of Drosophila species inferred with concatenated USCO nucleotide sequences (above) and with the multispecies coalescent (below) generated with different USCO extraction methods.
Figure S12. Phylogenetic trees of Heliconius species inferred with concatenated USCO nucleotide sequences (above) and with the multispecies coalescent (below) generated with different USCO extraction methods.
Figure S13. Phylogenetic trees of Darwin’s finches inferred with concatenated USCO nucleotide sequences (above) and with the multispecies coalescent (below) generated with different USCO extraction methods.
Figure S14. NMDS plots showing similarities between specimens inferred with SNP data of mzl-USCOs for the four study groups based on datasets generated with different data extraction methods.
Figure S15. Diagrams of STRUCTURE clustering results inferred with SNP data of mzl-USCOs for the four study groups based on datasets generated with different data extraction methods.
Figure S16. ML trees of concatenated multiple nucleotide sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB versions v.9 and v.10 and extracted with three methods fromAnopheles genomic data. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S17. Coalescent-based trees inferred in theAnopheles case study with data from the three USCO extraction approaches aligned in a single dataset using only those 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S18. ML trees of concatenated multiple nucleotide sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10 and extracted with three methods from Drosophilagenomic data. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S19. Coalescent-based trees inferred in theDrosophila case study with data from the three USCO extraction approaches aligned in a single dataset using only those 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S20. ML trees of concatenated multiple nucleotide sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10 and extracted with three methods from Heliconiusgenomic data. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S21. Coalescent-based trees inferred in theHeliconius case study with data from the three USCO extraction approaches aligned in a single dataset using only those 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S22. ML trees of concatenated multiple nucleotide sequence alignments of 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10 and extracted with three methods from genomic data of Darwin’s finches. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S23. Coalescent-based trees inferred in the Darwin’s finches case study with data from the three USCO extraction approaches aligned in a single dataset using only those 580 genes classified as mzl-USCOs in both OrthoDB v.9 and v.10. Trees, from left to right, are based on: 1) all data, 2) data after excluding alignment positions with missing data and gaps (gaps excluded), 3) a manually corrected alignment (corrected), and 4) a manually corrected alignment with additional exclusion of alignment positions with missing data and gaps (corrected + gaps excluded).
Figure S24. Results of species delimitation using tr2 and SODA in each case study and applying each of the three data extraction approaches.
Table S1. Metazoan genomes assembled to chromosome level included in this study, with numbers of single-copy mzl-USCO genes found in these genomes with the BUSCO software, number of chromosomes, genome size, median distance between neighboring USCOs, median distance between neighboring randomly chosen annotated protein coding genes, logarithms of those two distances, the distances divided by genome size, the quotient between these distances, p-value based on 10,000 replicates for the mzl-USCO distance being smaller, adjusted evenness values for chromosome length, number of coding genes, number of mzl-USCOs, chi-square values for distribution of mzl-USCOs compared to chromosome length and to number of coding genes and p-values derived from the chi-square tests.
Table S2. NCBI accession numbers of the raw reads from individuals analyzed in the four taxonomic case studies.