3.1 | Genome assembly, quality evaluation, and
annotation
Pse. libanotica had an estimated genome size of 3,273.28 Mb (1C)
and 3,048.18 Mb based on flow cytometry and k-mer statistics,
respectively (Figure S1). By integrating ~191 Gb (64×)
Illumina short paired-end reads, ~440 Gb (sequencing
depth 147×) Nanopore sequencing data, and ~330 Gb (110×)
high-throughput chromosome conformation capture (Hi-C) data, we
generated a chromosome-level assembly of Pse. libanotica . The
assembly sequence comprised 2.99 Gb of genome data, with a contig N50 of
920.96 Kb and a super-scaffold N50 of 380.09 Mb, accounting for 96.45%
of the estimated genome size with 1.34% heterozygous (Table 1; Table
S1, S2; Figure S2, S3). Of the 2.99 Gb scaffold sequences, 2.75 Gb
(91.97%) was anchored to seven super-scaffolds (chromosomes) using the
Hi-C platform (Figure 1; Table S3).
The integrity and base accuracy of the assembled Pse. libanoticagenome was verified by CEGMA (Parra et al., 2007) and BUSCO (Simão et
al., 2015). CEGMA showed that the assembled genome completely covered
228 (91.94%) of the 248 core genes, and partially covered 11 core
genes. Less than 4% of the core genes were not detected. BUSCO
displayed that 95.2% of the 1440 single-copy genes were homologous
sequences in Triticeae species (Table S4). The draft assembly was
further evaluated by mapping short high-quality reads into the assembled
genome. The mapping rate was 98.95%, with 58.95% of the average
sequencing depth (Table S5). In Pse. libanotica , 151,872
expressed sequence tag (EST) sequences were mapped to the genome with
>95% identity, in which 132,240 (87.10%) were aligned to
the reference genome with >87% coverage (Table S6).
Collectively, these data showed the high coverage of the assembled St
genome.
A total of 46,369 protein-coding genes were identified, of which 91.4%
had functional annotations (Figure 1B; Table S7, S8). We also identified
1,483 transfer RNAs, 18,438 miRNAs, 1,427 small nuclear RNAs, and 473
ribosomal RNAs (Supplemental Table 9). Repeat sequences comprised
71.62% of the assembled genome, with transposable elements (TEs) being
the major component (Figure 1B; Table 2; Table S10). The long terminal
repeats (LTRs) were the most abundant repeat type, and the other
retrotransposons, short interspersed nuclear elements (SINEs) and long
interspersed nuclear elements (LINEs) had the lowest proportion in the
final assembly. In the St genome, 0.31% of repeat sequences could not
be annotated (Table 2). Moreover, we predicted the centromere position
of Pse. libanotica , in which 2St, 3St, 4St, 6St, and 7St were
metacentric chromosomes, the 1St and 5St were submetacentric chromosomes
(Figure 1B; Table S11).