2.2.1 Read processing and generation of amplicon sequence variants
vAMPirus supports single- and paired-end raw Illumina read libraries as input. By default, read processing and ASV generation processes are performed prior to entering the DataCheck or Analyze pipelines (Figure 2, yellow box; Supplemental Figure S1). The read processing pipeline begins with a check of raw libraries using FastQC (v0.11.9, Andrews 2010), which creates and stores reports for review by the user. As FastQC is running, the program fastp (v0.20.1, Chen et al., 2018) automatically detects and removes adapter contamination, and performs quality/length filtering based on user-set parameters in the configuration file. fastp also performs over-representation analysis and (for paired-end input) base error correction during this step. Next, primers are removed from adapter-less reads using the bbduk.sh program within the BBTools software package (Bushnell 2014), and then another FastQC report is generated and stored. Cleaned reads are then merged using the program VSEARCH (v2.21.1, Rognes et al., 2016) and merged reads (from all libraries) are then concatenated into a single fastq file. For accurate ASV generation, it is imperative that the merged reads be the same length (Edgar, 2016b). To ensure this, merged reads are globally trimmed to a user-specified maximum read length using fastp. Merged reads with the set length are then extracted from the total merged read file using the program bbduk.sh and dereplicated using the program VSEARCH (v2.21.1, Rognes et al., 2016), producing a unique read file containing read representation information. Amplicon sequence variants are then generated from this unique read file with VSEARCH and the UNOISE3 algorithm (Edgar, 2016b; Rognes et al., 2016). Chimeric ASVs are detected and removed using VSEARCH and the UCHIME3 algorithm (Edgar, 2016a). Prior to entering downstream pipelines, vAMPirus provides users the option to filter ASVs with DIAMOND blastx (v2.0.15, Buchfink et al., 2015) to remove non-target sequences or to focus their analyses on a subset of ASVs/aminotypes. These steps produce a final ASV fasta file that is then used as input for the DataCheck and Analyze pipelines.