2.2.1 Read processing and generation of amplicon sequence
variants
vAMPirus supports single- and paired-end raw Illumina read libraries as
input. By default, read processing and ASV generation processes are
performed prior to entering the DataCheck or Analyze pipelines (Figure
2, yellow box; Supplemental Figure S1). The read processing pipeline
begins with a check of raw libraries using FastQC (v0.11.9, Andrews
2010), which creates and stores reports for review by the user. As
FastQC is running, the program fastp (v0.20.1, Chen et al., 2018)
automatically detects and removes adapter contamination, and performs
quality/length filtering based on user-set parameters in the
configuration file. fastp also performs over-representation analysis and
(for paired-end input) base error correction during this step. Next,
primers are removed from adapter-less reads using the bbduk.sh program
within the BBTools software package (Bushnell 2014), and then another
FastQC report is generated and stored. Cleaned reads are then merged
using the program VSEARCH (v2.21.1, Rognes et al., 2016) and merged
reads (from all libraries) are then concatenated into a single fastq
file. For accurate ASV generation, it is imperative that the merged
reads be the same length (Edgar, 2016b). To ensure this, merged reads
are globally trimmed to a user-specified maximum read length using
fastp. Merged reads with the set length are then extracted from the
total merged read file using the program bbduk.sh and dereplicated using
the program VSEARCH (v2.21.1, Rognes et al., 2016), producing a unique
read file containing read representation information. Amplicon sequence
variants are then generated from this unique read file with VSEARCH and
the UNOISE3 algorithm (Edgar, 2016b; Rognes et al., 2016). Chimeric ASVs
are detected and removed using VSEARCH and the UCHIME3 algorithm (Edgar,
2016a). Prior to entering downstream pipelines, vAMPirus provides users
the option to filter ASVs with DIAMOND blastx (v2.0.15, Buchfink et al.,
2015) to remove non-target sequences or to focus their analyses on a
subset of ASVs/aminotypes. These steps produce a final ASV fasta file
that is then used as input for the DataCheck and Analyze pipelines.