3.1 | Genome assembly, quality evaluation, and annotation
Pse. libanotica had an estimated genome size of 3,273.28 Mb (1C) and 3,048.18 Mb based on flow cytometry and k-mer statistics, respectively (Figure S1). By integrating ~191 Gb (64×) Illumina short paired-end reads, ~440 Gb (sequencing depth 147×) Nanopore sequencing data, and ~330 Gb (110×) high-throughput chromosome conformation capture (Hi-C) data, we generated a chromosome-level assembly of Pse. libanotica . The assembly sequence comprised 2.99 Gb of genome data, with a contig N50 of 920.96 Kb and a super-scaffold N50 of 380.09 Mb, accounting for 96.45% of the estimated genome size with 1.34% heterozygous (Table 1; Table S1, S2; Figure S2, S3). Of the 2.99 Gb scaffold sequences, 2.75 Gb (91.97%) was anchored to seven super-scaffolds (chromosomes) using the Hi-C platform (Figure 1; Table S3).
The integrity and base accuracy of the assembled Pse. libanoticagenome was verified by CEGMA (Parra et al., 2007) and BUSCO (Simão et al., 2015). CEGMA showed that the assembled genome completely covered 228 (91.94%) of the 248 core genes, and partially covered 11 core genes. Less than 4% of the core genes were not detected. BUSCO displayed that 95.2% of the 1440 single-copy genes were homologous sequences in Triticeae species (Table S4). The draft assembly was further evaluated by mapping short high-quality reads into the assembled genome. The mapping rate was 98.95%, with 58.95% of the average sequencing depth (Table S5). In Pse. libanotica , 151,872 expressed sequence tag (EST) sequences were mapped to the genome with >95% identity, in which 132,240 (87.10%) were aligned to the reference genome with >87% coverage (Table S6). Collectively, these data showed the high coverage of the assembled St genome.
A total of 46,369 protein-coding genes were identified, of which 91.4% had functional annotations (Figure 1B; Table S7, S8). We also identified 1,483 transfer RNAs, 18,438 miRNAs, 1,427 small nuclear RNAs, and 473 ribosomal RNAs (Supplemental Table 9). Repeat sequences comprised 71.62% of the assembled genome, with transposable elements (TEs) being the major component (Figure 1B; Table 2; Table S10). The long terminal repeats (LTRs) were the most abundant repeat type, and the other retrotransposons, short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs) had the lowest proportion in the final assembly. In the St genome, 0.31% of repeat sequences could not be annotated (Table 2). Moreover, we predicted the centromere position of Pse. libanotica , in which 2St, 3St, 4St, 6St, and 7St were metacentric chromosomes, the 1St and 5St were submetacentric chromosomes (Figure 1B; Table S11).