2.3 Genome assembly and quality assessment
We estimated the genome size and heterozygosity of the P. leopardus genome by k-mer analysis, using the quality-filtered reads. The k-mer counts frequencies was computed with Jellyfish (v2.2.10) (Marcais & Kingsford, 2011) using k = 17 and a maximum kmer count of 10,000. The k-mer distribution was measured and plotted using GenomeScope (Vurture et al., 2017). The genome size was calculated with the formula G = N17-mer /D17-mer, where the N17-mer is the total number of 17-mers, and D17-mer denotes the peak frequency of 17-mers.
We de novo assembled the 10× Genomics short reads into contigs and scaffolds using Supernova (v1.2) (Weisenfeld et al., 2017) . Gaps in the initial assembly were filled with Gapcloser (v1.12) (Luo et al., 2012) with the parameters of “avg_ins=364, max_ins=500 and min_ins=260”. The draft assembly was then anchored and oriented into a chromosomal-scale assembly using the Hi-C scaffolding approach. Firstly, the raw Hi-C reads were filtered with HiC-Pro (v2.8.0) (Servant et al., 2015). Then 3d-dna (v170123) (Dudchenko et al., 2017) with parameters of “-m haploid -s 0 -c 24” was used to anchor the primary contigs and scaffolds into chromosomes. The inter / intra-chromosomal contact maps were built and visualized with Juicebox (Durand et al., 2016).
To further improve the integrity and accuracy of the genome assembly, we employed TGS-GapCloser, which uses low depth (≥ 10×) single molecule sequencing long reads without any error correction to close gaps in the draft assembly (Xu et al., 2019). The long sequences were split into three groups, including total reads (with options –min_idy 0.2, –min_match 200 –r_round 1), reads with length ≥ 20 kb (with options –min_idy 0 –min_match 0 –r_round 3) and reads with length in 2-20kb (with options –min_idy 0 –min_match 0 –r_round 3), and each group were used to fill the corresponding aligned gaps.
The completeness of the genome assembly were assessed by Benchmarking Universal Single-Copy Orthologs (BUSCO) (Waterhouse et al., 2017) and GC content analyses. The single copy orthologues of actinopterygii_obd9 (BUSCO, v2.0) were searched against the assembled genome using BUSCO tool. The GC content and average sequencing depth across the genome were also measured with 10 Kb non-overlapping sliding windows and the windows harboring more than 50% N’s were filtered. No external contamination was found in the genome.