2.3 RNA preparation and sequencing
Total RNA was extracted from the tissues of pooled mites (adult mites,
nymphs, larvae and eggs) using TRIzol reagent (Invitrogen, Carlsbad, CA,
USA) according to the manufacturer’s instructions. RNA degradation and
contamination were monitored by 1 % agarose gel electrophoresis. RNA
purity was assessed using Nanodrop®2000 spectrophotometer (ThermoFisher
Scientific, UK) and then quantified. RNA integrity was assessed using an
RNA Nano 6000 Assay Kit with the Bioanalyzer 2100 system (Agilent
Technologies, Santa Clara, CA, USA). RNA sequencing libraries were
constructed using the NEB Next mRNA Library Prep Kit following the
manufacturer’s instructions and sequenced on the Illumina HiSeq X
platform. Finally, a total of 25.03 Gb clean data were generated.
Genome assembly and evaluation of S. scabiei var.cuniculi genome
Jellyfish (v2.0) was used to estimate genome size based on k -mer
(kmer=17) distribution using short-insert size libraries
(Kingsford, 2011). The estimated genome
size of S. scabiei var. cuniculi is 49.83 Mb and the
heterozygous ratio is 1.02%.
Using long reads generated by the PacBio Sequel platform, contigs were
assembled using the canu (canu-1.7) software
(Koren et al. , 2017) with
parameters setting as ‘-correct genomeSize=50m gnuplotTested=true
stopOnReadQuality=false -pacbio-raw pacbio.fasta -trim genomeSize=50m
gnuplotTested=true stopOnReadQuality=false -pacbio-corrected
PacBio.correctedReads.fasta.gz -assemble genomeSize=50m
gnuplotTested=true stopOnReadQuality=false -pacbio-corrected
PacBio.trimmedReads.fasta.gz’. The initial assembly was then polished
using Quiver (smrtlink 6.0.1) with default parameters
(Chin et al. , 2013).
Heterozygosity in the assembly was removed by Purge Haplotigs software
(v1.0.4) (Roach et al. , 2018).
Short Illumina reads were then used to correct any remaining errors by
pilon (v1.22) with parameters set as follows: ‘-Xmx300G –diploid
–threads 20’ (Walker et al. ,
2014). Finally, we used Hi-C data to scaffold S. scabiei var.cuniculi genome to chromosome-level by Lachesis software
(version-201701) with default parameters
(Burton et al. , 2013).
To evaluate the accuracy of the assembly at single base level, short
Illumina reads were mapped to the S. scabiei var. cuniculigenome using BWA (H. Li & Durbin, 2009)
with parameters setting as ‘-k 32 -w 10 -B 3 -O 11 -E 4’ and performed
variant calling with SAMtools (v1.8) (H.
Li, 2011). Meanwhile, assembly completeness was assessed based on
Benchmarking Universal Single-Copy Orthologs (BUSCO v4.0,
arthropoda_odb9) (Simao et al. ,
2015) and Core Eukaryotic Genes Mapping Approach (CEGMA V2.5)
(Parra et al. , 2007).
Genome annotation
Homologous comparison and de novo prediction methods were used to
annotate the repeat sequences on S. scabiei var. cuniculigenome. RepeatMasker and the RepeatProteinMask v4.0.8
(https://www.girinst.org/repbase/) were performed for homologous
comparison against Repbase database
(https://www.girinst.org/repbase/)
(Jurka et al. , 2005). For ab
initio prediction, LTR_FINDER
(http://tlife.fudan.edu.cn/ltr_finder/)(Xu
& Wang, 2007), RepeatScout
(http://www.repeatmasker.org/)(Priceet al. , 2005) and RepeatModeler (v2.1)
(http://www.repeatmasker.org/RepeatModeler.html) were first used for
repetitive elements de novo candidate database constructing, and
further annotated using RepeatMasker. Tandem repeat was predicted using
TRF (http://tandem.bu.edu/trf/trf.html).
Gene prediction was performed through combination methods of
homology-based prediction, de novo prediction and
transcriptome-based prediction. For homologous annotation, protein
sequences including S. scabiei var. canis ,Metaseiulus occidentalis , Ixodes scapularis ,Tetranychus urticae , Drosophila melanogaster ,Tropilaelaps mercedesae , Pediculus humanus ,Tribolium castaneum and Stegodyphus mimosarum were aligned
against the S. scabiei var. cuniculi genome using TBLASTN
(Altschul et al. , 1990). Blast hits
that correspond to reference proteins were concatenated by Solar
software (version 0.9.6) and low-quality records were filtered. Sequence
of each reference protein was extended upstream and downstream by 1000
bp to represent a protein-coding region. GeneWise
(Birney et al. , 2004) software was
used to predict gene structure contained in each protein region.
Homology predictions were denoted as “Homology-set”. All RNA-seq clean
data were first de novo assembled using Trinity
(v2.0)(Grabherr et al. , 2011) and
the assembled sequences were then aligned against the S. scabieivar. cuniculi genome using PASA pipeline
v2.0.2(Haas et al. , 2003) with
BLAT (Kent, 2002) as the aligner. Gene
models created by PASA were denoted as PASA-T-set (PASA Trinity set). We
simultaneously employed five tools of Augustus
(Stanke & Morgenstern, 2005), GeneID
(Guigo et al. , 1992), GeneScan
(Burge & Karlin, 1997), GlimmerHMM
(Majoros et al. , 2004), and
SNAP(Korf, 2004) for ab initioprediction, in which Augustus, SNAP, and GlimmerHMM were trained by
PASA-T-set gene models. In addition, RNA-seq reads were directly mapped
to the genome using Tophat (v2.0.9)
(Trapnell et al. , 2009), and then
the mapped reads were assembled into gene models (Cufflinks-set) by
Cufflinks (Trapnell et al. ,
2010). According to these three approaches, all the gene models were
finally integrated by EvidenceModeler(EVM
v1.1.1)(Haas et al. , 2008).
Weights for each type of evidence were set as follows: PASA-T-set
> Homology-set > Cufflinks-set >
Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. In order
to get the untranslated regions (UTRs) and alternative splicing
information, PASA2 was used to update the gene structure. To achieve the
functional annotation, the predicted protein sequences were aligned
against public databases, including SwissProt
(Bairoch & Apweiler, 2000), NR database
(from NCBI), InterPro and KEGG pathway
(Kanehisa & Goto, 2000) (release 76). Of
that, InterproScan tool (Jones et
al. , 2014) in coordination with InterPro database
(Finn et al. , 2017) were applied
to predict protein function based on the conserved protein domains and
functional sites. NR, KEGG pathway and SwissProt database were mainly
used to map gene set to identify the best match for each gene.
Reconstruction of phylogenetic relationship and estimation of
divergence time
We retrieved the protein-coding sequence of 7 arthropod includingT. urticae , Apis mellifera , D. melanogaster ,I. scapularis , S. mimosarum , M. occidentalis ,T. mercedesae and a nematode Caenorhabditis elegans from
Ensembl (Release 85). For gene model with multiple alternative isoforms,
only the longest transcript was selected to represent this gene. To
identify orthologous genes, OrthoMCL pipeline with the parameter of
“-inflation 1.5” (L. Li et al. ,
2003) as used to construct gene families. In total, 15,542 gene
families were identified, involved in 1,338 single-copy orthologous gene
families.
The phylogenetic relationship of S. scabiei var. cuniculiwas reconstructed using the 1,338 shared single-copy orthologous genes.
The sequences were aligned by MUSCLE tool with default parameters
(Edgar, 2004). Sequences were then
concatenated to one super-gene sequence for each species and formed a
data matrix. The GTR-GAMMA was selected as the best substitution model
deduced by jModeltest tool (Darribaet al. , 2012). Then the phylogenetic analysis was performed
using maximum-likelihood (ML) algorithm in RAxML tool
(Stamatakis, 2006). The best-scoring ML
tree was inferred by rapid BP algorithm and ML searches after performing
1,000 rapid bootstraps. The MEGA (version 7) was used for visualizing
the constructed phylogenetic tree (Kumaret al. , 2016).
Furthermore, divergence time between S. scabiei var.cuniculi and other species were estimated based on the phylogeny.
Four calibration times obtained from TimeTree database were used to
calibrate divergence dates of other nodes on this phylogenetic tree
(Sudhir et al. , 2017). In this
process, we implemented the Monte Carlo Markov Chain algorithm for
divergence time estimation by MCMCtree tool in PAML package
(Yang, 2007). Finally, we found thatS. scabiei var. cuniculi was separated away from its
closest species of T. urticae at approximate 338.8 million years
ago.
Homolog annotation method was applied for gene models prediction of
scabies mites variants (isolated from pigs and
humans)(Mofiz et al. , 2016),Psoroptes ovis (Burgess et
al. , 2018) and Dermatophagoides pteronyssinus(Waldron et al. , 2017) that only
have genome assembly files available in public database. OrthoMCL
pipeline was employed to acquire shared single copy genes, molecular
phylogenetic relationship of scabies mites from four hosts with
plant-living, free-living, predatory mites, and other parasitic mite
species that have available genomic data sets was constructed based on
shared single copy genes generated from the same methods mentioned
above.
Gene-family evolution and identification of gene families
related with detoxification
The evolutionary dynamics (expansion/contraction) of gene families were
analyzed using CAFÉ tool(De Bie et
al. , 2006 ) with a stochastic birth and death model. Global parameter
λ was estimated based on the phylogenetic tree and the datasets of gene
family clustering. Viterbi method in CAFÉ was employed to identify the
significantly changed families (p -value < 0.05).
Finally, significantly changed expansion/contraction gene families were
used to perform enrichment analysis.
To further analyze gene families of detoxification enzymes and
transporters, the S. scabiei var. cuniculi genome sequence
assembly was searched by TBLASTN (E -value=10-5)
(Altschul et al. , 1990) with close
arthropod relatives including T. urticae , T. mercedesae ,M. occidentalis , and D. melanogaster target gene family
protein sequences. The Basic Local Alignment Search Tool (BLAST) hits
were then conjoined by Solar software (Yuet al. , 2006). GeneWise was used to predict the exact gene
structure of the corresponding genomic region on each BLAST hit
(Birney et al. , 2004). We then
annotated preliminary identified candidate genes with Pfam database
(http://pfam.xfam.org/) and NR
database to get putative gene sets. Finally, a The proteins of target
genes were aligned with mafft software
(Katoh & Standley, 2014) and a
neighbor-joining tree was constructed with TreeBest
(Vilella et al. , 2009) with
aligned by MAFFT software (Katoh &
Standley, 2014).
Characterization of Hox genes
We identified putative Homeobox genes in S. scabiei var.cuniculi genome sequence assembly by performing TBLASTN
(E -value=10-5)(Altschulet al. , 1990), using curated Homeobox proteins from A.
mellifera, Bombyx mori, Anopheles gambiae, Daphnia pulex, D.
melanogaster, I. scapularis, Pachycrepoideus vindemmiae, Strigamia
maritima, T. urticae, and Tribolium castaneum . The Basic Local
Alignment Search Tool (BLAST) hits were then conjoined by Solar software
(Yu et al. , 2006). GeneWise was
used to predict the exact gene structure of the corresponding genomic
region on each BLAST hit (Birney et
al. , 2004). HMM searches of the above genes against the Pfam database
(http://pfam.xfam.org/) revealed each to have PF00046.29 domain. The
classification of deduced proteins and their integrity were verified
using BlASTP with model species D. melanogaster Hox genes.E -values < 10-5 was considered as
significant hits of similarity.
Phylogenetic and population genetic analysis
Twelve scabies mite colonies from three different mammal hosts (one from
dog, three from pigs and eight from rabbits) were sequenced on the
Illumina HiSeq X sequencing platform using standard procedures. In
addition, we also downloaded 7 individuals in the public database from
other representative populations, including one mite population isolated
from dogs, 2 isolated from humans and 4 isolated from pigs
(Mofiz et al. , 2016;
Rider et al. , 2015). Detailed
information for all 20 samples was shown in Tables S1 .
Consequently, a total of 136.45 Gb high-quality paired-end DNA sequence
was obtained. For downloaded datasets, high sequencing depth data were
truncated to around 50x coverage for further analysis.
FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was
used for reads quality control. Clean reads were then aligned to theS. scabiei var. cuniculi genome using Burrows-Wheeler
Alignment MEM (BWA-MEM) version 0.7.8 with default parameters except for
the “-t 4 -k 32 -M -R” option (H. Li &
Durbin, 2009). The ‘sort’ and ‘rmdup’ commands of SAMtools (v1.3) were
used to perform data manipulation and alignment statistics. SNP calling
was performed by SAMtools (v1.3) ‘mpileup’ command
(H. Li et al. , 2009), bcftools
‘call’ command. To estimate individual admixture assuming different
numbers of clusters, the genetic structure was inferred using sNMF
(http://membres-timc.imag.fr/Olivier.Francois/snmf/software.htm) ,
which is based on sparse non-negative matrix factorization algorithms.
We calculated the numbers of genetic clusters from 2 to 7 to explore the
convergence of individuals with default settings and plotted Kfrom 2-4.
The software GCTA
(http://cnsgenomics.com/software/gcta/)
was used for principal component analysis (PCA) with biallelic SNPs of
the 20 individuals. Only the first two significant components were
plotted. The pam function of Rggfortify package
(https://cran.r-project.org/web/packages/ggfortify/index.html)
was used to determine the significant level of the principal components.