Materials and methods
DNA
extraction, sequencing, and genome assembling
The total genomic DNA was extracted from the blood sample of an adult
male
Chinese rhesus macaque. The study protocol and data analyses were
formally approved by the Ethics Committee of West China Hospital
(Registration number: 20220211006). The long reads and short reads data
were produced using Oxford PromethION and Illumina platforms,
respectively. The basecalling of Nanopore’s output was performed using
Guppy
[102], where only reads with mean quality scores >7
were retained.
To improve accuracy and continuity of genome assembly, we conductedde novo assembling using a three-step approach. Step one: long
reads cleaning and de novo assembly. We conducted the first-round
of de novo assembling with NECAT v0.0.1 [103], which can
correct errors of Nanopore long noisy reads before de novoassembling. The raw reads were corrected and then de novoassembled with default parameters except for “MIN_READ_LENGTH=1000”.
Additionally, the NextDenovo v2.4.0 was also used for de novoassembly based on the cleaned reads of NECAT. We found a longer contig
N50 for NextDenovo than for NECAT (65.99Mb vs. 6.44Mb). Thus, the
contigs longer than 20Mb were used in step two. Step two:
reference-guided chromosome assignment. To anchor the locations of
contigs, we utilized a publicly available reference genome for the
Chinese rhesus macaque (rheMacS) and the long contigs from NextDenovo as
backbones to order and
orient
the NECAT raw contigs [34], with default parameters of the RagTag
v2.1.0 [104]. Step three: SVs sensitive
polishing.
We combined several tools with default parameters to further polish the
errors caused by long
reads,
including four rounds of racon v1.4.3 [105] with long reads and four
rounds of
pilon
v1.24 [106] with short reads. We finally conducted misassembly
correction with tools sensitive to structural variations, including
tigmint v1.2.6 [107] and
Inspector
v1.0.2 [108]. The final assembled genome was mapped to the Indian
macaque reference (Mmul_10) with minimap2 [109] and visualized with
dotplot by dotPlotly
(https://github.com/tpoorten/dotPlotly).
Genome continuity was further evaluated with
BUSCO
metric [57] and
QUAST
statistics [110]. BUSCO can assess the completeness by the
percentage of complete near-universal single-copy orthologs while QUAST
statistics can be used to estimate N50, length distribution, etc. The
genome was annotated with the long-read RNAseq data from a pool of
tissues (heart, liver, whole brain, intestine, testis, muscle, and
pancreas) using MAKER2 [111]. The “hidden genes” were retrieved
using the Biomart tool from Ensembl (v105) with a focus on only the
“one to one” orthologous genes between mouse, rat, human, and macaque.
The genes that were only absent from macaque were identified as “hidden
genes”. These genes were then mapped back to the CR2 genome and only
the genes with no mapping coordinates were defined as the real “hidden
genes”. The recovery of these genes was conducted by mapping (BLAST
tools, Evalue < 0.001) to the long-read RNAseq transcriptome
obtained from the assembly-guided mapping and de novo assembly
with tools including STAR v2.7.10a [112] and StringTie2 [113].