Case studies: efficiency of USCO extraction
Our results confirmed that mzl-USCOs can easily be retrieved and
extracted from published WGS datasets in sufficient number to be useful
to infer reliable phylogenies on all systematic levels, from subspecies
to phylum. We found the quantity of recovered mzl-USCOs as well as the
number of retrieved alignment positions to be similar to those obtained
with approaches in which USCOs were obtained by DNA target enrichment
(Table 2; see also Dietz et al., 2023). Similar to the results when
assembling raw reads from hybrid enrichment libraries (Dietz et al.,
2023), results of gaining USCOs from WGS data were heavily influenced by
the data extraction method used and the quality of the genomic data (see
also Dietz et al. 2023). These factors might pose limits to the
possibilities of how genomic USCO sequence data can be generated and
combined in future systematic research (see below).
Independently from how Metazoa-level USCO markers were obtained, either
through target enrichment (Dietz et al., 2023) or by extraction from WGS
raw reads (this study), it is crucial for understanding the robustness
and ability to standardize species inference with Metazoa-level USCO
markers to clarify if and how differently generated mzl-USCOs can be
simultaneously analyzed. This knowledge is important to evaluate the
sustainability of the marker system, particularly for the case that a
marker-specific database is to be created. In this context, orthology
assessment is crucial, as it is used to identify and define the
fundamental entities that subsequent tools rely on. Due to the
increasing quality and number of available published genome assemblies,
the data available from successive versions of OrthoDB is continuously
evolving towards a more comprehensive taxon coverage. Consequently, the
set of genes defined as mzl-USCOs (i.e., those present in single-copy in
at least 90% of known genomes of a given group; in our case, Metazoa)
is expected to change, but also to converge over time. This change over
time is reflected by the number of single-copy genes present in at least
90% of the genomes of Metazoa in different versions of OrthoDB: it has
changed from 978 in OrthoDB v. 9, to 954 in OrthoDB v. 10, to 1,268 in
OrthoDB v. 11 (Kuznetsov et al., 2023). This change is likely driven by
the steady addition of genomes considered for orthology prediction in
OrthoDB (330 in v. 9, 448 in v. 10; 812 in v. 11). While the turnover in
which genes are considered mzl-USCOs is currently still high, we expect
changes regarding the set of genes flagged as mzl-USCOs becoming smaller
with increased knowledge of the evolution of gene families.
Although genes identified as mzl-USCOs were not entirely identical
between the different versions of OrthoDB used here, we observed that
these changes had minimal impact on the overall size of USCO nucleotide
alignments, on alignment completeness, on the topology of inferred
phylogenetic trees, on nucleotide sequence variation (SNP) clustering,
and on species delimitation results, as long as a sufficiently large
number of orthologs are involved and if incomplete alignment sections
are removed.
Comparing the results of BUSCO-based to the Orthograph-based extraction
of USCOs, we find that the former yields generally longer nucleotide
sequences per gene. This is probably because, besides the conserved
region of a gene identified by the HMMs, BUSCO also includes flanking
regions with a length of 5-20 kpb (Simão et al. 2015). Within these
regions, the full gene is then determined by using the gene prediction
of the Augustus pipeline (Stanke & Waack, 2003; Stanke et al., 2006) to
extend the gene model beyond the conserved region specified by the HMM.
In contrast, Orthograph searches only for the conserved region for which
the HMM was created. This effect is particularly important since the
mzl-USCO HMMs contain only the gene region conserved across all Metazoa.
Full length coding regions could also be extracted with Orthograph by
using HMMs created specific tor the individual groups, although these
would then be not necessarily homologous across all groups anymore.
The additional nucleotide sequence information of data based on BUSCO
software seems to impact the phylogenetic signal neither positively nor
negatively. Consequently, the results of phylogenetic tree
reconstruction, SNP clustering, and species delimitation analysis hardly
differ between BUSCO- and Orthograph-based extraction approaches.
However, our results show that, even with extensive filtering and manual
curation, BUSCO- and Orthograph-derived USCO data should not be analyzed
together. The resulting phylogenies obtained with mixed data may be
severely misleading. Data pruning improves the results, but large
discrepancies may remain, especially when using approaches that analyze
concatenated alignments rather than coalescent-based approaches that
rely on gene trees for phylogenetic inference. Therefore, we recommend
extracting mzl-USCOs in one consistent way. The underlying problem is
that alignments of systematically different sequences pose a problem to
currently used alignment programs. The ongoing change in sets of genes
being classified as USCOs could become a problem in terms of data
overlap for the sustainability of the marker system, as datasets
generated with older and newer OrthoDB versions might not be fully
comparable. Here the overlap between Metazoa-level orthologs from
OrthoDB v. 9 and OrthoDB v. 10 was relatively moderate (only 580 out of
978 resp. 954 genes; 59% resp. 61%). This has an impact especially on
the possibility to combine future USCO data with already available
mzl-USCOs generated with DNA target enrichment using baits which are
based on earlier and/or different OrthoDB versions. However, we expect
that future versions of OrthoDB will have an increasing amount of
overlap of mzl-USCOs between versions, since genes that are present and
single copy in 90% of the genomes should converge as soon as all
taxonomic groups are covered evenly in OrthoDB.