Abstract
Amplicon sequencing is an effective and increasingly applied method for
studying viral communities in the environment. Here, we present
vAMPirus, a user-friendly, comprehensive, and versatile DNA and RNA
virus amplicon sequence analysis program, designed to support
investigators in exploring virus amplicon sequencing data and running
informed, reproducible analyses. vAMPirus intakes raw virus amplicon
libraries and, by default, performs nucleotide- and protein-based
analyses to produce results such as sequence abundance information,
taxonomic classifications, phylogenies, and community diversity metrics.
The vAMPirus pipelines additionally include optional approaches that can
increase the biological signal-to-noise ratio in results by leveraging
tools not yet commonly applied to virus amplicon data analyses. In this
paper, we validate the vAMPirus analytical framework and illustrate its
implementation into the general virus amplicon sequencing workflow by
recapitulating findings from two previously published double-stranded
DNA virus datasets. As a case study, we also apply the program to
explore the diversity and distribution of a coral reef-associated RNA
virus. vAMPirus is incorporated with the Nextflow workflow manager,
offering straightforward scalability, standardization, and communication
of virus lineage-specific analyses. The vAMPirus framework itself is
also designed to be adaptable; community-driven analytical standards
will continue to be incorporated as the field advances. vAMPirus
supports researchers in revealing patterns of virus diversity and
population dynamics in nature, while promoting study reproducibility and
comparability.
Introduction
From the human gut to sediments in the deep ocean, viruses are abundant,
diverse, and shape the systems they inhabit (Breitbart et al., 2018;
Correa et al., 2021; Suttle, 2007). The advent of high-throughput
sequencing (HTS) techniques like amplicon sequencing has transformed the
field of virology, illuminating the currently unculturable virosphere
(Labadie et al., 2020; Metcalf et al., 1995; Paez-Espino et al., 2017;
Zayed et al., 2022) and helping identify the impacts of viruses on
ecosystem and host function (Braga et al., 2020; Breitbart et al., 2018;
Thurber et al., 2017; Uyaguari-Diaz et al., 2016). Amplicon sequencing
is a targeted, polymerase chain reaction (PCR)-based HTS approach that
allows deep characterization of genetic variants within virus
populations (Short et al. 2010). The targeted nature of amplicon
sequencing reduces the economic and computational investment required
for spatiotemporal investigations of virus communities at ecologically
relevant scales (see Finke & Suttle, 2019; Frantzen & Holo, 2019;
Grupstra et al., 2022; Gustavsen & Suttle, 2021; Howe-Kerr et al.,
2022; Montalvo-Proaño et al., 2017). The number of studies leveraging
virus amplicon sequencing has increased rapidly over the last two
decades (e.g., 16 peer-reviewed publications in 1998 compared to 127 in
2021 based on a Web of Science search of ‘virus amplicon sequencing’,
November 2022).
The general virus amplicon sequencing workflow includes: 1. Extraction
of virus nucleic acid (DNA or RNA), 2. PCR amplification of virus marker
gene or transcript, 3. HTS of virus marker gene amplicons, and 4.
Bioinformatic analysis of sequencing data (Short et al., 2010; Figure
1). The effective analysis and interpretation of amplicon sequencing
data relies on biologically accurate binning of marker gene sequences
into taxonomically or ecologically distinct units. Recognizing viral
taxa or ecotypes, however, can be challenging. For example, non-model
viruses have limited baseline information available to inform the
selection of clustering thresholds. Other viruses, such as RNA viruses,
have error-prone polymerases and produce quasispecies, a population
structure consisting of large numbers of variant genomes (Domingo &
Perales, 2019) that may not be easily resolved by the same clustering
percentage. Amplicon sequence variants (ASVs) are a promising
non-clustering-based approach for virus amplicon analyses that offers
high precision and biological accuracy as error-derived sequence
variants are removed during ASV generation (Callahan et al., 2017;
Edgar, 2016b). In addition, since the identity of an ASV is not specific
to a given dataset (as identity can be in clustering of marker gene
sequences into de novo OTUs based on a percent identity value,
Callahan et al., 2017), ASVs and their unique translations
(‘aminotypes’, see Grupstra et al., 2022) can be compared directly among
studies (Callahan et al., 2017).
To promote the standardization, reproducibility and cross-comparison of
DNA and RNA virus amplicon sequence analyses, we developed the automated
bioinformatics tool, vAMPirus (github.com/Aveglia/vAMPirus). vAMPirus
intakes raw (unprocessed) virus amplicon libraries, performs all read
processing and diversity analysis steps, and produces reports detailing
results (e.g., relative abundance plots, community diversity metrics)
with interactive figures and tables. vAMPirus supports initial
explorations of viral amplicon sequence datasets via a ‘DataCheck’
pipeline, which generates an HTML report with information on data
quality and sequence diversity. Results from the exploratory DataCheck
pipeline can then be used to optimize parameters in the read processing
or ASV generation steps within the vAMPirus ‘Analyze’ pipeline; this can
improve the signal-to-noise ratio in downstream analyses. vAMPirus is
integrated with the Nextflow workflow manager, which uses a
configuration file that can be shared among investigators, facilitating
the standardization and dissemination of virus amplicon sequence
analyses across projects and research groups. To that end, we also
created the vAMPirus Analysis Repository
(https://zenodo.org/communities/vampirusrepo/) to act as a central
location for all published vAMPirus analyses. vAMPirus is intended to be
accessible to researchers with a range of bioinformatics experience
levels, and includes substantial help documentation with step-by-step
instructions for running the tool
(https://github.com/Aveglia/vAMPirus/blob/master/docs/). By
facilitating the standardization of viral lineage-specific analyses and
increasing the signal-to-noise ratio in community diversity analyses,
vAMPirus will enhance the effectiveness of virus amplicon studies and
lead to a more developed understanding the global virosphere.