3 Results and Discussion
Genome assembly
We adopted second-generation and third-generation sequencing for the comprehensive analysis of theR. pedestris genome. A total of 118.9 Gb clean data was generated using the Illumina platform. The k-mer (k=19) analysis indicated that theR. pedestris genome had high heterozygosity and the estimated size was 1.04 Gb (Figure S1). Using the PacBio RS II platform, a total of 132 Gb clean data was generated with an N50 read length of 18.1 kb (Table S2). Based on the sequencing data above, we obtained a contig-level assembly that contained 934 contigs and 1,211,875,501 bases. The contig N50 reached 13,968,850 bp, which is significantly longer than the genome assemblies of other bugs (Table 1; Table S3).
Then, we used Hi-C long‐range scaffolding to improve our assembly. A total of 105 Gb raw data was generated, and 82-Gb filtered clean reads were used for the following analysis (Table S4). As a result, more than 95.7% of the total genome bases were successfully anchored to 6 unique chromosomes, with the chromosomic lengths ranging from 162.5 Mb to 285.2 Mb (Figure 2A; Figure 3). Chromosomic staining was performed to validate the Hi-C result, and we identified 12 chromosomes (n=5A+XX) in female R. pedestris and 11 chromosomes (n=5A+X) in male R. pedestris (Figure 2B). A previous work demonstrated that the sex chromosome system of most insects in the family Coreidae is XO (Muramoto 1978). The concordant results between Hi-C assembly and chromosomic staining confirmed the accuracy of our genome assembly and indicated thatR. pedestris is XO-determinant.
There were 745 scaffolds that failed to anchor. These unlocalized scaffolds were further blasted against the bacteria database to identify the potential bacteria that resided within R. pedestris . A total of 168 scaffolds showed high homology to known bacteria, and these bacterial-origin scaffolds were used for further analysis (Table S5). Taking these results together, we assembled the R. pedestrisgenome (containing 6 chromosomes and 577 unlocalized scaffolds) with a final size of 1.193 Gb (GC content: 33.8%) and a scaffold N50 of 181.3 Mb (Table 1).
To assess the completeness of the genome assembly, the assembled genome was subjected to BUSCO analyses against insect, arthropod, metazoan, and eukaryote datasets. The results showed that 94.7% (against the insect dataset) to 96.5% (against the arthropod dataset) of BUSCO genes could be identified in the R. pedestrisgenome, suggesting the completeness and high quality of our genome assembly (Figure S2). There were 2,774,084 repetitive sequences identified, accounting for 56.0% of the R. pedestris genome (Table S6). Additionally, we identified 1,266,644 SSRs in the genome assembly (Table S7).
Identification of the Serratia marcescens genome
R. pedestris is well known for its intimate association withBurkholderia , which provides beneficial effects on insect development (Takeshita and Kikuchi 2017). Our genome assembly did not detect any Burkholderia -origin sequence, indicating the absence of Burkholderia in this insect strain. There were 168 bacterial-origin scaffolds from Serratia , Enterobacter ,Kluyvera , and Wolbachia in our genome assembly (Table S5).Serratia was reported to confer host resistance to invading parasitoids (Bai et al. 2019), while Wolbachia supplies essential nutrition for host reproduction (Ju et al. 2020). Although the insect used for genome sequencing was free of Burkholderia , we did not observe any abnormal phenotype previously described in aposymbiotic insects (Kikuchi et al. 2007). We presumed that additional bacteria, other than Burkholderia , provide specific nutritional or other contributions to R. pedestris , which is an aspect that deserves further investigation.
Among the bacterial-origin scaffolds, the longest scaffold (sc00000) could be successfully assembled into a circle, and we mainly focused on this complete bacteria genome in the present study. Phylogenetic analysis showed that the sc00000 was closely related to Serratia marcescens Db11 (Figure 4). S. marcescens is a gram-negative bacillus that belongs to Enterobacteriaceae . It has a wide host range, including plants, insects and animals (Grimont and Grimont 1978). In R. pedestris , S. marcescens was reported to colonize the insect midgut, but its function remained unknown (Lee et al. 2017). Our study, for the first time, assembled the complete genome of S. marcescens from R. pedestris , and we tentatively named this bacterial strain as S. marcescens Rip1 (hereafter: Rip1). The complete genome of Rip1 comprises a single circular chromosome of 5.04 Mbp, with an overall G+C content of 59.4% (Figure 4). SomeSerratia strains, such as S. marcescens CAV1761 andS. marcescens AS-1, contain additional plasmids (Sakuraoka et al. 2019). However, no sequence homologous to these plasmids was found in Rip01. Overall, the genomes of Rip1 and Db11 are highly conserved and essentially collinear (Figure S3).
Previous work demonstrated that S. marcescenssecretes virulence factors to counter antimicrobial proteins produced by R. pedestris , thus escaping host immune defenses (Chen et al. 2017; Lee et al. 2017). We identified at least 15 antibiotic-resistance genes and 32 putative virulence factors in Rip1 (Figure 4, Table S8, Table S9). These genes confer antibiotic resistance to a wide variety of antibiotics, including fosfomycin, mupirocin, polymyxin, aminocoumar, beta-lactam, isoniazid, and fluoroquinolone, and might be key in the successful colonization ofS. marcescens in the host midgut.
Genome annotation
Oxford Nanopore technology (ONT) was used to aid the genome annotation. A total of 22 Gb clean data in 25,876,282 ONT subreads was obtained, with an N50 read length of 1,027 bp (Table S10). In total, we identified 21,562 non-redundant protein-coding genes using the LoReAn annotation pipeline. The number of predicted genes in R. pedestris was similar to those in G. buenoi and O. fasciatus but much higher than that in C. lectularius . Quality control showed that the predicted gene set was able to recover 92.0% (against the insect dataset) to 96.7% (against the eukaryote dataset) of BUSCO genes (Figure S4). The majority (21,139 genes, 98.0%) of the predicted genes were distributed within the 6 chromosomes. Among the 21,562 predicted genes, 17,909 genes were successfully annotated by at least one database in the NCBI Nr, Swiss-Prot, KOG, KEGG, GO, and InterPro domain databases (Figure S5-S8; Table S11). Additionally, we identified 10 ncRNAs, 51 rRNA, 4,103 tRNA, 221 small nuclear RNA, 80 microRNA, and 3,202 long ncRNAs in the R. pedestris genome (Table S12).
Expression patterns of predicted genes
RNA-seq data from 6 different tissues and 37 different development stages were analyzed (Table S13), and a total of 21,320 genes were determined to be transcribed (Supplementary files 1). For tissue-specific analysis, there were 498, 21, 401, 254, 2389, and 3181 genes found to be mainly expressed in the midgut, fat body, carcass, muscle, testis, and ovary, respectively (Figure S9A). Enrichment analysis demonstrated that most midgut-specific genes were involved in metabolism, exosomes, and proteasomes (Figure S10A), while most carcass-specific genes were involved in structural constituents of the cuticle, environmental adaption, amino acid metabolism, odorant binding and sensory perception (Figure S10B). The muscle-specific genes were mainly associated with the cytoskeleton, ion channels, lipid metabolism, and protein kinases (Figure S10C). There were abundant genes specifically expressed in gonads. For testis-specific analysis, the enriched pathways were mainly annotated as being associated with signaling and cellular processes, transporters, carotenoid biosynthesis, chromosome proteins, and cytochrome P450s (Figure S10D). For ovary-specific analysis, enriched pathways were associated with chromosome proteins, DNA metabolism, glycosyltransferase, and glycan metabolism (Figure S10E). Additionally, we used LC-MS/MS to identify proteins authentically expressed in these 6 tissues (Figure S9B). A total of 2,622 proteins was successfully identified, with a unipep≥2 (Supplementary files 2).
For the developmental analysis, a weighted gene co-expression network analysis (WGCNA) successfully assigned 9,887 genes into 32 modules (Figure S11). Among them, 6,684 genes (belonging to three modules) were more highly expressed in the egg period than during other stages. There were 1,467 and 227 genes that showed the highest expression in the male and female adult period, respectively. Additionally, we identified 352 genes (belonging to three modules) associated with insect molting. For example, insulin-like growth factor-binding protein 7, replication factor C, and endocuticle structural glycoprotein were more highly expressed before insect molting, while N-acetylneuraminate lyase, partitioning defective 3, and cytochrome P450 6a14 were more highly expressed during insect molting (Supplementary files 1). These data provide a good resource for future functional studies of R. pedestris.
Identification of the sex chromosomes and autosomes
To identify the potential sex chromosome, Illumina reads from male and female R. pedestris were mapped to the reference genome, respectively. In the XO system, males and females have only an X chromosome, and the MRPM of the X chromosome in females should be nearly two-fold higher than that in males (Ma et al. 2020). Based on this presumption, the shortest chromosome was determined to be the X chromosome, with nearly double the MRPM in females compared to that in males (1,414,872 versus 859,724). For the other 5 chromosomes, the female/male ratio of the MRPM ranged from 0.87 to 0.97, which is suggestive of autosomes (Figure 3, Table S14).
Gene family identification and phylogenetic tree construction
To investigate the phylogenetic position of R. pedestris , protein data from 12 hemipteran insects were retrieved from public database, and D. melanogaster was used to root the tree. A total of 23,850 gene families were clustered by OrthoMCL, and we selected 963 single copy genes with 356,827 reliable sites to construct the phylogenetic tree. The results showed thatR. pedestris is a sister taxon to the milkweed bug O. fasciatus , and the 2 species diverged from their common ancestor at approximately 35 million years ago (Figure 5). Three hematophagous bugs (R. prolixus , C. lectularius , and T. rubrofasciata ) form a sister lineage, and they clustered with three herbivorous bugs in Heteroptera. There were 27, 216, and 45 gene families identified as Sternorrhyncha-specific, Auchenorrhyncha-specific, and Heteroptera-specific, respectively. Gene-family expansion/contraction analysis demonstrated a higher frequency of gene contraction in Heteroptera. Compared with O. fasciatus , 817 gene families in R. pedestris underwent expansion, while 1130 gene families in R. pedestris underwent contraction (Figure 5).
Chromosome synteny
T. rubrofasciata (680 Mb in genome size), which is the only chromosomal-level bug genome available in the Heteroptera suborder, was reported to have 13 chromosomes, while R. pedestris has only 6 chromosomes. Synteny analysis demonstrated some fusion or fission events between the two bug species (Figure 6). For example, R. pedestrisChr1 contains segments from T. rubrofasciata Chr3, Chr5, Chr6, Chr7, and Chr11, and R. pedestris Chr2 contains segments fromT. rubrofasciata Chr1 and Chr4. The sex chromosome of R. pedestris showed high synteny with T. rubrofasciata Chr12 and Chr13. It is likely that Chr12 or Chr13 might be the sex chromosome ofT. rubrofasciata , which deserves further investigation.