Materials and methods
DNA extraction, sequencing, and genome assembling
The total genomic DNA was extracted from the blood sample of an adult male Chinese rhesus macaque. The study protocol and data analyses were formally approved by the Ethics Committee of West China Hospital (Registration number: 20220211006). The long reads and short reads data were produced using Oxford PromethION and Illumina platforms, respectively. The basecalling of Nanopore’s output was performed using Guppy [102], where only reads with mean quality scores >7 were retained.
To improve accuracy and continuity of genome assembly, we conductedde novo assembling using a three-step approach. Step one: long reads cleaning and de novo assembly. We conducted the first-round of de novo assembling with NECAT v0.0.1 [103], which can correct errors of Nanopore long noisy reads before de novoassembling. The raw reads were corrected and then de novoassembled with default parameters except for “MIN_READ_LENGTH=1000”. Additionally, the NextDenovo v2.4.0 was also used for de novoassembly based on the cleaned reads of NECAT. We found a longer contig N50 for NextDenovo than for NECAT (65.99Mb vs. 6.44Mb). Thus, the contigs longer than 20Mb were used in step two. Step two: reference-guided chromosome assignment. To anchor the locations of contigs, we utilized a publicly available reference genome for the Chinese rhesus macaque (rheMacS) and the long contigs from NextDenovo as backbones to order and orient the NECAT raw contigs [34], with default parameters of the RagTag v2.1.0 [104]. Step three: SVs sensitive polishing. We combined several tools with default parameters to further polish the errors caused by long reads, including four rounds of racon v1.4.3 [105] with long reads and four rounds of pilon v1.24 [106] with short reads. We finally conducted misassembly correction with tools sensitive to structural variations, including tigmint v1.2.6 [107] and Inspector v1.0.2 [108]. The final assembled genome was mapped to the Indian macaque reference (Mmul_10) with minimap2 [109] and visualized with dotplot by dotPlotly (https://github.com/tpoorten/dotPlotly).
Genome continuity was further evaluated with BUSCO metric [57] and QUAST statistics [110]. BUSCO can assess the completeness by the percentage of complete near-universal single-copy orthologs while QUAST statistics can be used to estimate N50, length distribution, etc. The genome was annotated with the long-read RNAseq data from a pool of tissues (heart, liver, whole brain, intestine, testis, muscle, and pancreas) using MAKER2 [111]. The “hidden genes” were retrieved using the Biomart tool from Ensembl (v105) with a focus on only the “one to one” orthologous genes between mouse, rat, human, and macaque. The genes that were only absent from macaque were identified as “hidden genes”. These genes were then mapped back to the CR2 genome and only the genes with no mapping coordinates were defined as the real “hidden genes”. The recovery of these genes was conducted by mapping (BLAST tools, Evalue < 0.001) to the long-read RNAseq transcriptome obtained from the assembly-guided mapping and de novo assembly with tools including STAR v2.7.10a [112] and StringTie2 [113].