Methods
Dovetail Chicago library preparation and
sequencing:
A Chicago library was prepared as described previously [42].
Briefly, ~500ng of high molecular weight (HMW) gDNA
(mean fragment length = 75 kbp) was reconstituted into chromatin in
vitro and fixed with formaldehyde. Fixed chromatin was digested withDpnII , the 5’ overhangs filled in with biotinylated nucleotides,
and then free blunt ends were ligated. After ligation, crosslinks were
reversed, and the DNA purified from protein. Purified DNA was treated to
remove biotin that was not internal to ligated fragments. The DNA was
then sheared to ~350 bp mean fragment size, and
sequencing libraries were generated using NEBNext Ultra enzymes and
Illumina-compatible adapters. Biotin-containing fragments were isolated
using streptavidin beads before PCR enrichment of each library. The
libraries were sequenced on an Illumina HiSeq X to produce 500 million
151 base paired-end reads.
Dovetail Hi-C library preparation and
sequencing:
A Dovetail Hi-C library was prepared in a similar manner as described
previously [43]. Chromatin was fixed in the nucleus with
formaldehyde, extracted, and DpnII digested. 5’ overhangs were
filled with biotinylated nucleotides, and then free blunt ends were
ligated. Crosslinks were reversed for DNA purification from proteins.
Purified DNA was treated to remove biotin outside of ligated fragments.
The DNA was then sheared to ~350 bp, and libraries were
generated using NEBNext Ultra enzymes and Illumina-compatible adapters.
Biotinylated fragments were isolated with streptavidin beads before PCR
enrichment of libraries and sequenced on an Illumina HiSeq X to produce
531 million 2x151 bp paired end reads.
Genome Assembly
The H. glycines genome was assembled with Falcon using previously
deposited Pacbio sequencing (SRX2692203 - SRX2692222). Falcon unzip
0.4.0 [44] was then used to reduce the heterozygosity in the
assembly. Dovetail Genomics scaffolded this assembly with Chicago and
Hi-C reads using a modified SNAP read mapper
(http://snap.cs.berkeley.edu) and an iterative assembly with
HiRise. This Dovetail assembly was further scaffolded with the
previously mentioned Pacbio subreads using Sspace-longread v1.1
[45]. Gmcloser 1.6.2 [46] was used to fill gaps using PacBio
circular consensus (CCS) reads. Pilon 1.22 [47] was then used to
polish the assembly using ~142 million 260bp PE Illumina
reads, which were processed with Trim Galore 0.4.5 [48], Hisat2
2.1.0 [49], and Samtools 1.9 [50]. The genome was then
scaffolded using the previously mentioned Hi-C reads using Juicer 1.7.6
[43], 3D-DNA 180419 [51], and manually corrected using Juicebox
1.9.8 [52]. Once pseudomolecules were assembled, the genome was
polished with Pilon 1.23 [47] using CCS reads, then with the 142
million Illumina PE reads, and then with 38 iterations of polishing
using Pacbio CCS reads with Pilon 1.23. A final round of polishing was
performed with the 142 million Illumina 250bp reads (SRR8381095) using
Pilon 1.23 [47].
Gene Prediction
Genes were predicted using a Mikado 1.24 pipeline [53] that picked
consensus transcripts from seven transcriptome assemblies and gene
predictions. First, the genome was masked using RepeatModeler 1.0.11
[54] and RepeatMasker 4.0.9 [55]. Previously published Illumina
RNA-seq reads (SRX3339090-
SRX3339098) were processed with Trim Galore 0.4.5 [48], Hisat2 2.1.0
[49], and Samtools 1.9 [50] on both a masked and an unmasked
genome. Previously published NCBI expressed sequence tags (downloaded
06-17-19) and IsoSeq (SRX3702373) were aligned using Gmap (version
2018-03-25) [56]. These data were utilized with Braker 2.1.0
[57] using all three data sources, annotating both an unmasked
assembly and a masked assembly to compensate for parasitism-related CNV
genes. Transcriptomes were assembled using the guidance of a masked
genome with Trinity 2.3-.2 [58, 59], Class2 2.1.7 [60],
Stringtie 1.3.4a [61, 62], and Spades 3.13.1 [63]. This first
Mikado prediction was utilized in a second round of Mikado, supplemented
with masked braker prediction and a Maker 2.31.10 [64] gene
annotation from a 368-scaffold version of the assembly. All resulting
predictions from the second round of Mikado were collapsed into gene
loci via using shared intron/exon borders with Cufflinks gffread
(Cufflinks 2.2.1) [65].
The Maker annotation mentioned previously was run over four rounds, with
Maker’s internal algorithm first, then Augustus 3.2.1, then Snaphmm
2006-07-28, followed by GeneMark-ES 4.32. Repeatmodeler 1.0.11 and
RepeatMasker 4.0.9 were used to perform the softmasking used in the
annotation. Maker utilized all transcripts and proteins from related
species genomes [66, 67] and UniProt [68], including:Bursaphelenchus xylophilus [69] , Ditylenchus
destructor , Globodera pallida [70] , Globodera
rostochiensis [71] , Globodera ellingtonae [72] ,Meloidogyne floridensis [73] , Meloidogyne hapla
[74], Meloidogyne incognita [75] ,Parastrongyloides trichosuri , Rhabditophanes_KR3021 ,Strongyloides papillosus , Strongyloides ratti ,Strongyloides stercoralis ; all H. glycines ESTs from NCBI
[67], and a Braker 1.9 [76] annotation on this unmasked assembly
using published RNA-seq (SRX3339090- SRX3339098).
Functional Gene
Annotations
Gene annotations were compiled
from Interproscan 5.27-66.0 [77] and BLAST [78] searches to NCBI
NT and nr databases downloaded on 10-23-19 [67], as well as
swissprot/uniprot databases downloaded on 12-09-2019 [68]. Genes
encoding transposable element-associated proteins were identified using
Bedtools 2.27.1 [79] with exon overlaps to Repeatmodeler-predicted
transposable elements.
Differential Gene
Expression
The strandedness of the RNA-seq was evaluated with RseQC V4.0 [80,
81], followed by alignment to the genome with HiSat (2.2.0) [49],
and converted to bam with Samtools (1.1.0) [50]. Read counts were
calculated with FeatureCounts from Subread package (1.6.0) [82],
followed by Deseq2 (1.20.0) [83] with P-value cutoffs at 0.05 to
determine differential expression between the samples.
BUSCO analysis
Universal single copy orthologous genes were assessed using BUSCO 3.0.2
[84-86] on both the predicted proteins and the genome against the
nematoda ODB9 dataset. Missing genes were verified with BLAST [78]
to the predicted protein sequences using a 0.01 e-value and 1.6x -0.4x
length cutoff (S table 1).
Effector gene mapping
Effector proteins were mapped to the predicted proteome using Diamond
0.9.23 [87]. Effector genes were mapped to the genome using Gmap
(2018-03-25) [56]. Secreted proteins were identified with SignalP
5.0 [88] on the predicted proteome.
Repeat Prediction
Multiple repeat predictions were pursued to finely detail genome
structure. To comprehensively predict the structure of transposable
elements in the genome with Extensive de-novo TE Annotator, EDTA 1.7
[89]. Tandem repeat finder 4.0.9 [90] was run on the genome to
identify tandem repeats. A repeat prediction sensitive to copy number
variation was also pursued with RepeatModeler 1.0.11 [54] and
RepeatMasker 4.0.9 [55].
Synteny
Genome alignments were performed using Mummer3 [91] and merged for
display in Circos 0.69-6 [92] using Bedtools merge [79]. By
inferring gene orthology from primary mapping sites of the predicted
transcripts from our genome with Gmap 2018-03-25 [56], we inferred
gene-based synteny with iAdHoRe 3.0.01 [93].