Results
Genome
sequencing and de novo assembling before SVs calling
To determine the population identity of our focal sample of a Chinese
rhesus macaque (CR2), we obtained 292,478,835,000 base pairs of Illumina
short-read data (95.27x). Subsequently, we performed population genetic
analyses based on totally 27 samples, including the newly-sequenced and
publicly available NGS data for Chinese and Indian rhesus macaques. We
confirmed the subspecies ancestry of CR2 using the method the principal
component analysis (PCA). We found that CR2 was genetically within the
cluster of the Chinese rhesus population instead of the Indian
population (Figure 1A). This result supported CR2 as an acceptable
representative of Chinese rhesus macaque subspecies for further
sequencing and subsequent analyses.
For the purpose of SVs calling using the “assembly-based” method
[53], we initially conducted de novo assembly of CR2 based on
both NGS and long-read data. In total, we obtained 98,513,621,108 base
pairs of Nanopore long-reads from the ONT-PromethION platform, with a
read N50 length 49.91 Kb and average read length 26 Kb. Based on the
estimation of
GCE
v1.0.2 with kmer 17 [54], the genome size of Chinese rhesus macaque
is ~ 3.12 Gb. Thus, we achieved ~32x
genome coverage of long reads. A hybrid and cost-effective methodology
was applied using these long reads and ~95x short reads
to conduct de novo assembling. Previous reports have revealed
that a hybrid assembly would need over 50x short-read coverage and over
30x long-read coverage [55, 56]. Thus, the data volume in this study
is sufficient for de novo assembly. The major steps of genome
assembling included reads cleaning (NECAT v0.0.1 and fastp), de
novo assembling (NECAT v0.0.1 and NextDenovo v2.4.0), reference-guided
chromosome assignment (RagTag v2.1), and long and short reads correction
(Pilon v1.24, Racon v1.4.3, Inspector v1.0.2, and Tigmint v1.2.6).
We found 14.87 times more contigs with NECAT than with NextDenovo (3376
vs. 227), probably because NECAT captured more smaller contigs than
NextDenovo. In addition, the total length of contigs from NECAT is
greater than that from NextDenovo (3.13Gb vs. 2.85Gb). This suggests a
higher yield but more fragmented genome from NECAT than from NextDenovo.
In addition, NextDenovo achieved a higher contig N50 than the two
previously reported references (65.99Mb vs. 46.61Mb and 8.19Mb). Thus,
we jointly utilized long contigs from NextDenovo and shorter contigs
from NECAT following the RagTag “scaffold” function and reduced the
number of contigs (2182) while maintaining genome completeness (3.13
Gb).
Because of no available optical mapping or Hi-C data for our sample, we
took a reference-guided method (rheMacS) to assign, order, and orient
some contigs into the chromosomes [34]. We reduced the number of
contigs into 1,479, lower than the previous two references (Table 1).
After applying both long and short reads correction, we obtained a
chromosome-level genome reference for CR2 (Table 1 and Supplementary
figure 1). The full-length nanopore RNAseq data, which were sequenced
based on pooled tissues of heart, liver, whole brain, intestine, testis,
muscle, and pancreas, were used for gene annotation.
Previously, two genomic references for the Chinese (rheMacS) and Indian
rhesus macaque subspecies (Mmul_10 or mm10) were based on mainly the
PacBio long reads [34]. In this study, the long reads of CR2 were
obtained from Nanopore technology (R9.4), which offered an advantage of
methylation information as well as the long reads. The comparison among
summary statistics of three references revealed that CR2 has a slightly
improved N50 values and lower counts for contigs and scaffolds, relative
to both rheMacS and Mmul_10 (Table 1). The BUSCO evaluation [57],
which is based on 9226 near-universal single-copy genes in mammalian
species (odb10), revealed that the compete BUSCOs is 95.9% (Figure 1B),
slightly higher than the previously reported percentage of a Chinese
rhesus macaque (rheMacS) [34].
Genes
missing in the mm10 assembly were uncovered
Mouse, rat, and macaque are among the most widely used animal models for
human medical issues. Despite closer evolutionary relationship between
human and macaque, mouse and rat are oftentimes better choice for human
transgenic studies due to not only the operation convenience of mouse
and rat but also the missing of orthologous genes in current macaque
genome. We tried to understand how many human orthologous genes are
present in mouse but absent in macaque. Based on Ensembl annotation
(v105) of comparative species, we retrieved ”one to one ” orthologous
genes between human, mouse, rat, and macaque and obtained human
orthologous genes present in rodent species (mouse and rat) but absent
in macaque. After mapping these human genes back to mm10 genome
reference, we obtained 89 genes without reliable mapping coordinates of
orthologous genes in macaque, using even the most relaxed threshold
(Evalue 0.001 in TBLASTN). These results suggest that these genes may be
either bona fide evolutionarily lost genes or hidden genes due to
annotation error in genome reference of mm10.
To identify the potential hidden genes, we firstly produced a
reference-guided assembly of transcripts based on Nanopore direct RNA
sequencing data from a pool of seven tissues and then used these
transcripts as the reference to retrieve the 89 potential hidden genes.
Notably, 75.28% of these genes (67 out 89) were mapped to transcripts
with reliable signals (E-value < 0.001) and 65 genes showed
E-value less than 10-10. The coding sequences of these
genes showed a slightly but not significantly higher median value of GC
contents (51.93%) relative to genomic background level (51.45%). The
“absence” of these “hidden genes” could be due to the previously
neglected complex transcripts using only NGS data [58], suggesting
the importance of long-read RNAseq data in decoding a more complete
spectrum of transcriptome.