Spatial distribution and potential linkage patterns of mzl-USCOs
in genomes
We extracted mzl-USCOs from chromosome-level assembled genomes of 239
species of Metazoa, covering almost all major lineages of Protostomia
and Deuterostomia. As expected, we found that the large majority of the
mzl-USCOs were consistently present in most investigated species, and
pairwise aligned nucleotide or amino acid sequences of mzl-USCOs from
different species were found to overlap in the multiple sequence
alignment of each gene to a high degree (Figure S1). The median distance
between neighboring mzl-USCOs on a chromosome was on average 742,876 bp
(+/- 607,054 bp SD). Considering all possible pairs of mzl-USCOs, we
found that in the vast majority of genomes the two mzl-USCOs in a pair
were located on different chromosomes (Fig. 1). Specifically, we found
only 1.3% of the analyzed pairs of mzl-USCOs to be located on the same
chromosome in more than 50% of the analyzed species. Only 0.2% of all
analyzed pairs of mzl-USCOs were found on the same chromosome in more
than 75% of the analyzed species. Looking at these latter pairs in more
detail, we found the two mzl-USCOs in each pair to be spatially
separated on average by a mean distance over all taxa of 11.6 Mbp (+/-
6.1 Mbp SD) on a given chromosome, with the spatial separation differing
widely between taxa (average standard deviation 18.2 Mbp, +/- 8.8 Mbp
SD).
While these data imply that mzl-USCOs can be regarded as genetically
largely unlinked in practical applications, mzl-USCOs show a slight
tendency to cluster compared to randomly chosen protein-coding genes.
Specifically, we found physical distances between neighboring mzl-USCOs
normalized by genome size to be consistently slightly lower than
expected by chance when compared with distances from the same number of
randomly chosen protein-coding genes. In all but three taxa, the median
distance, both absolute and normalized by genome size, was lower in the
USCO data than the median in the randomly chosen protein-coding genes
(inferred from 10,000 simulations in each taxon; Fig. 2). In 195 taxa
(82% of all investigated taxa), the difference was statistically
significant (p < 0.05). On average, the median absolute
distance was lower by 106,062 +/– 91,451 bp in the real data, the
normalized distance by 9.91*10-5 +/–
6.57*10-5 of genome size (15.77 +/– 9.4 %). The
extent to which mzl-USCOs cluster more than randomly chosen genes tends
to be larger in arthropods than in vertebrates (Table S1).
We found the distribution of absolute distances (in nucleotides) between
neighboring mzl-USCOs on chromosomes to be highly correlated with the
taxon’s genome size (correlation of median distance with genome size: r
= 0.9714, p < 0.001). When binning absolute distances in
eleven categories and using a PCA to visualize the degree of similarity
between taxa in their distance values (plot not shown), separation of
taxa along the first axis (which explained 71% of the total variance)
strongly correlated with the logarithm of the taxon’s genome size (r =
-0.9818, p < 0.001). We focused in the present investigation
on the conspicuous patterns found in normalized distances (nucleotides
divided by genome size), as this metric was less confounded by the
organism’s genome size: correlation of median normalized distance with
genome size was -0.17201 (p = 0.008). When binning normalized distances
between neighboring mzl-USCOs on chromosomes in eleven categories and
using a PCA to visualize the degree of similarity (Fig 3b), we found the
clustering of taxa in some instances to correspond noticeably with high
systematic units, such as Insecta (red triangles), teleost fishes (gray
dots), birds (black squares), and mammals (black triangles; Fig 3).
The adjusted evenness of the distribution of mzl-USCOs between
chromosomes ranged between 0.58 and 0.99 (mean 0.87 +/– 0.09). It tends
to be especially low in birds and especially high in teleost fish (Table
S1). It is highly correlated with both the evenness of chromosome length
(r = 0.83, p = 6.26 * 10-61) and especially that of
the distribution of all protein-coding genes (r = 0.94, p = 1.98 *
10-110).
In many taxa, our chi-square test showed significant deviations of USCO
distribution from the distribution of chromosome lengths (Table S1). In
215 taxa (90% of all investigated taxa), the chi-square test showed a
statistically significant (p < 0.05) deviation without
correction for multiple test, and in 153 of the taxa (64%), the test
result remained significant after Bonferroni correction for multiple
tests. The deviation tended to be particularly high in birds and
particularly low in teleost fish. The chi-square test showed that the
deviation from the distribution of all protein-coding genes was
significant in 170 taxa (71%), but in only 43 of these taxa (18%) it
remained so after Bonferroni correction. A correlation with phylogenetic
placement of the taxa was less obvious than in the comparison with
chromosome length.
To assess whether the phylogenetic signal contained in mzl-USCOs is
sufficient to infer the phylogenetic relationships of the investigated
taxa, we used the extracted mzl-USCOs of the 239 species of Metazoa for
phylogenetic analyses. The inferred phylogenetic trees based on a
supermatrix of amino acid sequences (Fig. S2) were largely consistent
with the respective current state of the art phylogenetic hypotheses
(e.g., Laumer et al., 2019; Irisarri et al., 2017; Esselstyn et al.,
2017). Discrepancies occurred in a few rapid radiations. For example, in
the USCO-derived phylogenies of Neoaves we found hummingbirds to be more
closely related to passerines than to falcons and parrots, contradicting
results from phylogenomic studies of Jarvis et al. (2014) and Prum et
al. (2015). Such discrepancies were also found in multi-species
coalescent-based trees obtained from analyzing amino acid data (Fig.
S4), which had overall low support values, however. Both supermatrix-
and coalescent-based phylogenetic inferences based on nucleotide
sequence data using codon positions 1 and 2 (Fig. S3, S5) resulted in
some highly questionable phylogenetic estimates, such as a non-monophyly
of Arthropoda.