2.5.7 Paired-end library preparation and whole genome sequencing
Genomic DNA extraction was achieved by standard phenol:chloroform extraction as described previously.[32] The quality of the genomic DNA was checked on a 0.6% standard agarose gel stained with ethidium bromide using a NanoDrop ND-1000 spectrophotometer (ThermoFisher, Waltham, MA, USA). Quantification was performed using a Qubit 3.0 Fluorometer and a dsDNA BR assay kit (ThermoFisher, Waltham, MA, USA). Paired-end libraries for Illumina sequencing were prepared using a TruSeq Nano DNA Low Throughput Library Prep kit (Cat # 20015964, Illumina, San Diego, CA, USA) according to the supplier’s instructions. Briefly, library construction began with fragmentation of 200 ng of genomic DNA to a peak fragment size of 550 bp using a Covaris M220 instrument (Covaris, Woburn, MA, USA) with the following settings: 45 s at 20% duty cycle, intensity 50, temperature 20°C, and 200 cycles per burst. The DNA fragments were then purified using SPB beads included in the TruSeq kit, followed by end-repair, A-tailing, and ligation of Illumina adapters to the ends of the fragments. After SPB purification, the library was PCR-amplified using the following cycling conditions: initial denaturation at 95°C for 3 min, followed by 8 cycles at 98°C for 20 s, 60°C for 15 s, and 72°C for 30 s, and a final extension at 72°C for 5 min. After PCR clean-up, 1 µL of the library was used for validation in a 2100 Bioanalyzer (Agilent, Santa Clara, CA, USA). Library quantification was accomplished on a Qubit 3.0 Fluorometer using a dsDNA BR assay kit. The library was then sequenced at the VBCF NGS Core facility (Vienna, Austria) on an Illumina HiSeq 2500 sequencing instrument using v4 sequencing chemistry and a 2×125 nt paired-end sequencing protocol.
Raw genomic short reads from all wild-type and mutant strains were quality trimmed with Trimmomatic v0.35.[33](ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 AVGQUAL:20 MINLEN:50). We generated four reference sequences that represented our versions of the T7 and A1 wild-type strains: B3<T7-GFP> wt, B3<T7-Fab> wt, BQ<A1-GFP> wt, and BQ<A1-Fab> wt. Variants between the T7 wild-type strains and the publicly available BL21(DE3) genome (NCBI Reference Sequence: NC_012892.2) were extracted using their quality-trimmed reads. The same was repeated for A1 wild-type strains using the BL21 genome.[34] To extract the variants, we used breseq [35] (limit fold-coverage 150, minimum mapping quality 20, maximum read mismatches 15, no junction prediction, require match-fraction 0.5). The four updated references were obtained introducing the found variants with gdtools APPLY provided with breseq.
Using the quality-trimmed reads of the mutant strains, one genome sequence was assembled for each mutant. The peak insert size of each sequencing library was determined by mapping up to 10,000 read pairs with HISAT2 [36] on the appropriate reference genome sequence. Quality-trimmed reads were then down-sampled to a coverage of 150× using seqtk sample (https://github.com/lh3/seqtk). Down-sampled reads were assembled using SOAP-denovo[37] (-M 3 -L 100 -K 75 -k 25 -d 5 -D 5). The peak insert size determined for each library was used to calibrate the genome assembler. The specified config file parameters were: reverse_seq=0, asm_flags=3, max_rd_len=125, rd_len_cutoff=125, rank=1, map_len=30. Assembly metrics (contig N50 length and contig N90 length) were assessed using Biopython.[38]
Single nucleotide polymorphisms (SNPs) and short insertion/deletion (indel) variants up to 20 bp in length between mutant and wild-type strains were extracted. Short reads from each mutant were mapped onto their corresponding wild-type reference using breseq.[35] The program was run in polymorphism mode (polymorphism prediction, polymorphism frequency cutoff 0.05) using the same parameters as for the generation of mutated references. True variants were then selected by mapping the contigs of the assembled genomes of the mutants; only variants with both read and contig mapping were retained. The clonality of each variant was inferred from the ’AF’ field in the VCF files.
The presence of candidate genome rearrangements was screened using the assembled genome sequences for the mutants. Contigs were mapped onto the corresponding wild-type reference sequence with nucmer[39] (mum, breaklen 10, mincluster 500, delta, diagfactor 0.05, maxgap 30000, minmatch 50). The produced mapping records were then passed to the show-diff tool, available with nucmer, which highlights potential rearrangements and their type.