Results
Genome sequencing and de novo assembling before SVs calling
To determine the population identity of our focal sample of a Chinese rhesus macaque (CR2), we obtained 292,478,835,000 base pairs of Illumina short-read data (95.27x). Subsequently, we performed population genetic analyses based on totally 27 samples, including the newly-sequenced and publicly available NGS data for Chinese and Indian rhesus macaques. We confirmed the subspecies ancestry of CR2 using the method the principal component analysis (PCA). We found that CR2 was genetically within the cluster of the Chinese rhesus population instead of the Indian population (Figure 1A). This result supported CR2 as an acceptable representative of Chinese rhesus macaque subspecies for further sequencing and subsequent analyses.
For the purpose of SVs calling using the “assembly-based” method [53], we initially conducted de novo assembly of CR2 based on both NGS and long-read data. In total, we obtained 98,513,621,108 base pairs of Nanopore long-reads from the ONT-PromethION platform, with a read N50 length 49.91 Kb and average read length 26 Kb. Based on the estimation of GCE v1.0.2 with kmer 17 [54], the genome size of Chinese rhesus macaque is ~ 3.12 Gb. Thus, we achieved ~32x genome coverage of long reads. A hybrid and cost-effective methodology was applied using these long reads and ~95x short reads to conduct de novo assembling. Previous reports have revealed that a hybrid assembly would need over 50x short-read coverage and over 30x long-read coverage [55, 56]. Thus, the data volume in this study is sufficient for de novo assembly. The major steps of genome assembling included reads cleaning (NECAT v0.0.1 and fastp), de novo assembling (NECAT v0.0.1 and NextDenovo v2.4.0), reference-guided chromosome assignment (RagTag v2.1), and long and short reads correction (Pilon v1.24, Racon v1.4.3, Inspector v1.0.2, and Tigmint v1.2.6).
We found 14.87 times more contigs with NECAT than with NextDenovo (3376 vs. 227), probably because NECAT captured more smaller contigs than NextDenovo. In addition, the total length of contigs from NECAT is greater than that from NextDenovo (3.13Gb vs. 2.85Gb). This suggests a higher yield but more fragmented genome from NECAT than from NextDenovo. In addition, NextDenovo achieved a higher contig N50 than the two previously reported references (65.99Mb vs. 46.61Mb and 8.19Mb). Thus, we jointly utilized long contigs from NextDenovo and shorter contigs from NECAT following the RagTag “scaffold” function and reduced the number of contigs (2182) while maintaining genome completeness (3.13 Gb).
Because of no available optical mapping or Hi-C data for our sample, we took a reference-guided method (rheMacS) to assign, order, and orient some contigs into the chromosomes [34]. We reduced the number of contigs into 1,479, lower than the previous two references (Table 1). After applying both long and short reads correction, we obtained a chromosome-level genome reference for CR2 (Table 1 and Supplementary figure 1). The full-length nanopore RNAseq data, which were sequenced based on pooled tissues of heart, liver, whole brain, intestine, testis, muscle, and pancreas, were used for gene annotation.
Previously, two genomic references for the Chinese (rheMacS) and Indian rhesus macaque subspecies (Mmul_10 or mm10) were based on mainly the PacBio long reads [34]. In this study, the long reads of CR2 were obtained from Nanopore technology (R9.4), which offered an advantage of methylation information as well as the long reads. The comparison among summary statistics of three references revealed that CR2 has a slightly improved N50 values and lower counts for contigs and scaffolds, relative to both rheMacS and Mmul_10 (Table 1). The BUSCO evaluation [57], which is based on 9226 near-universal single-copy genes in mammalian species (odb10), revealed that the compete BUSCOs is 95.9% (Figure 1B), slightly higher than the previously reported percentage of a Chinese rhesus macaque (rheMacS) [34].
Genes missing in the mm10 assembly were uncovered
Mouse, rat, and macaque are among the most widely used animal models for human medical issues. Despite closer evolutionary relationship between human and macaque, mouse and rat are oftentimes better choice for human transgenic studies due to not only the operation convenience of mouse and rat but also the missing of orthologous genes in current macaque genome. We tried to understand how many human orthologous genes are present in mouse but absent in macaque. Based on Ensembl annotation (v105) of comparative species, we retrieved ”one to one ” orthologous genes between human, mouse, rat, and macaque and obtained human orthologous genes present in rodent species (mouse and rat) but absent in macaque. After mapping these human genes back to mm10 genome reference, we obtained 89 genes without reliable mapping coordinates of orthologous genes in macaque, using even the most relaxed threshold (Evalue 0.001 in TBLASTN). These results suggest that these genes may be either bona fide evolutionarily lost genes or hidden genes due to annotation error in genome reference of mm10.
To identify the potential hidden genes, we firstly produced a reference-guided assembly of transcripts based on Nanopore direct RNA sequencing data from a pool of seven tissues and then used these transcripts as the reference to retrieve the 89 potential hidden genes. Notably, 75.28% of these genes (67 out 89) were mapped to transcripts with reliable signals (E-value < 0.001) and 65 genes showed E-value less than 10-10. The coding sequences of these genes showed a slightly but not significantly higher median value of GC contents (51.93%) relative to genomic background level (51.45%). The “absence” of these “hidden genes” could be due to the previously neglected complex transcripts using only NGS data [58], suggesting the importance of long-read RNAseq data in decoding a more complete spectrum of transcriptome.