Distribution and potential linkage patterns of mzl-USCOs
This study is the first comparative analysis of the physical distribution of mzl-USCOs in the genomes of a wide range of animal taxa. We did not find mzl-USCOs to exhibit a noteworthy tendency of physical linkage when compared to randomly chosen protein-coding genes. Physical distances between USCO genes were found to be in general much larger than the average distances across which loci can be assumed to be linked in evolutionary timescales (<1000 bp; Springer & Gatesy, 2016). The resulting average extent of linkage of loci located on the same chromosome is thus likely negligible and cannot be a prioriassumed to violate assumptions of multispecies coalescent analyses, irrespective of whether the method is used for phylogenetic reconstruction or species delimitation. Although there was considerable variation across taxa, we found neighboring pairs of mzl-USCOs to be on average spatially located somewhat more closely together than pairs obtained by randomly choosing the same number of annotated protein-coding genes. A possible explanation for this result could be that mzl-USCOs have a small tendency to cluster in genomic regions that are under selection to remain in single-copy.
Mzl-USCOs were found to be rather evenly distributed over the chromosomes and do not cluster on particular chromosomes, indicated by high values of adjusted evenness of the USCO distribution. However, taxa with chromosomes of unequal length tended to have an unequal distribution of mzl-USCOs. This was demonstrated by the positive and significant correlation of the evenness of chromosome length and protein-coding gene distribution with that of the USCO distribution. As expected, longer chromosomes, and especially chromosomes with relatively more protein-coding genes than others, also contain more mzl-USCOs. However, chi-square tests showed that this correlation is not necessarily linear. In nematodes, for example, the correlation of the number of mzl-USCOs with that of protein-coding genes was negative, although this was based on few chromosomes of rather similar length. In particular the deviation of USCO number from chromosome length tended to be higher in birds which also have highly unequal chromosome sizes within their genomes. This deviation is probably due to the fact that gene density is high in short chromosomes (microchromosomes; e.g., International Chicken Genome Sequencing Consortium, 2004), which are particularly common in birds but are also found in some other vertebrates (Waters et al., 2021). Significant deviations from the distribution of protein-coding genes in general are probably caused by taxon-specific groupings of mzl-USCOs on certain chromosomes. However, such deviations do not seem to be conserved across major lineages, a pattern that is consistent with our observation that groupings of mzl-USCOs on the same chromosome are in most cases not phylogenetically conserved according to the current sampling of taxa. However, as some lineages were poorly covered by these analyses, it is difficult to make accurate statements about this for metazoans in general.
Intra-locus recombination is known to bias coalescent-based phylogenomic analyses (Gatesy & Springer, 2014; Edwards et al., 2016; Springer & Gatesy, 2018). Among eukaryotes, the genome-wide recombination rate is known to vary over at least one order of magnitude (Stapley et al., 2017). Intraspecific recombination rates are also known to vary between the sexes and across the genome, with recombination hot spots in which most crossovers occur (Jeffreys et al., 2001; Kauppi et al., 2004; Niehuis et al., 2010). Recombination hot spots have been studied in a variety of species, including fruit flies (Chan et al., 2012), crickets (Blankers et al., 2018), birds (Kawakami et al., 2017), and mammals (Jeffreys et al., 2001; Kauppi et al., 2004; Arnheim et al., 2007; Penalba & Wolf, 2020). In humans, recombination hot spots are regions of 1 to 2 kbp that are spatially separated from each other by larger regions (50–100 kb) with lower recombination activity (Myers et al., 2005; Baudat et al., 2010). Simulation studies have shown that species tree estimation is robust to recombination even if the amount of recombination exceeds that found in extant organisms (Lanier & Knowles, 2012; Zhu et al., 2022). However, these studies used a model of constant recombination rates across the genome (instead of a model of recombination hot spots), which might not reflect the situation in a given genome properly. We therefore expect that data partitioning and its implementation within models of species inference using the multispecies coalescent will remain a hot topic in the future, as will be some other parameters in species delimitation approaches, e.g., effective population size, whose fluctuation is known to impact species delimitation analyses (Ahrens et al., 2016).
The distribution of distances between USCO genes reported by us exhibited lineage-specific patterns (Fig. 3; Figure S6, S7). Some of these lineages showed an extraordinary variation. This lineage-specific variation likely reflects peculiarities in the genomic architecture of different higher taxa, but a closer investigation of these phenomena is beyond the scope of this study.