2.2.4 vAMPirus DataCheck pipeline and report
The vAMPirus DataCheck pipeline can help investigators determine the optimal parameters for read processing, ASV generation, and other downstream analyses conducted in the Analyze pipeline. The DataCheck pipeline is particularly beneficial for investigators working on nascent virus systems because it facilitates the informed establishment of gene-, lineage- or system-specific analysis standards. The pipeline produces an HTML report that displays information such as sequencing success per sample, read characteristics (e.g., read length, GC content), and ASV/aminotype sequence properties. The DataCheck pipeline also provides insight into the ASV sequences by clustering them across a range of nucleotide and amino acid similarities and plotting the resultant number of cASVs per similarity value. Briefly, nucleotide-based de novo cASVs are produced by clustering ASV sequences using 24 different percent identity values (55%, 65%, 75%, 80-100%) with VSEARCH. To generate de novo pcASVs, ASVs are first translated using the program VirtualRibosome (v2.0, Wernersson, 2006), then clustered into de novo pcASVs using the same 24 percent identities with the program CD-HIT (v.4.8.1, Fu et al., 2012; Li & Godzik, 2006). For each percent identity value, the number of ncASVs and pcASVs is quantified and visualized as a scatter plot in the DataCheck report. This is a common approach used to determine the clustering percentage (e.g., Gustavsen and Suttle 2021): the percent similarity at which there is no longer a linear rise in the number of cASVs (the inflection point) is selected for sequence clustering. Optionally, users can also apply the program oligotyping (Eren et al., 2015) to calculate Shannon entropy values per sequence position for both ASV and aminotypes, which is then displayed in the report. An example vAMPirus DataCheck report is available at github.com/Aveglia/vAMPirusExamples.