Introduction
Environmental DNA (eDNA) is providing previously unthinkable insights into the aquatic environment as a non-invasive and relatively cost-efficient tool, illustrating the presence-absence and distribution of certain species and the composition of the community of an ecosystem, and is particularly informative to evaluate these parameters as a result of variable conditions (Furlan et al., 2019; Gallego et al., 2020; Stat et al., 2017; Thomsen & Willerslev, 2015). Besides the logistics and technical constraints to acquire samples, there are further challenges in accurately characterizing the biome, including the molecular strategy used (i.e. DNA extraction and the marker or gene of choice), and the reference databases used to identify the origin of the DNA found in a certain environment (Jackman et al., 2021; Schenekar et al., 2020; Wang et al., 2021). Environmental DNA identification needs robust, comprehensive, and accurate DNA reference libraries based on solid taxonomic frameworks, and this effort requires more exhaustive, comprehensive, and revamped efforts in light of recent technical advancements in DNA sequencing (Margaryan et al., 2021; Novak et al., 2020; Taberlet et al., 2012).
There are currently two main strategies for biomonitoring surveys to describe the community composition or evaluate the abundance of certain species: DNA metabarcoding approaches or targeted molecular assays using quantitative Polymerase Chain Reaction (qPCR) or digital PCR (dPCR) (Shu et al., 2020). These molecular tools are well established and have been added to the toolbox of conservation management. These studies make use of the public genomic databases and, more specifically, of mitochondrial genes found in public repositories to construct well represented alignments to identify the amplicon sequence variants (ASV) in the case of metabarcoding or achieve the desired specificity for primer design in the case of targeted assays. However, in spite of the colossal sequencing effort undertaken in the last two decades with initiatives like the Barcode of Life (Mugnai et al., 2021; Ratnasingham & Hebert, 2007), mitochondrial genome reference data for the breadth of taxa of interest are yet not accessible and if they are, only one or a few genes may be available. Among these, the suitability of existing sequence data for the purpose of designing species-specific oligonucleotides is typically suboptimal for targeted eDNA assays. Such is the situation of one the most traditionally sequenced genes for barcoding purposes, the cytochrome c oxidase subunit 1 (COI) of the mitochondrion that offers scattered, short conserved regions that are unsuitable for primer design in order to effectively discriminate the taxon of interest (Langlois et al., 2020; Margaryan et al., 2021; Schroeter et al., 2020). Metabarcoding analysis of fish targets more conserved genes or regions where universal primers can be placed flanking interspecifically variable regions. Examples are the 12S rRNA subunit (12S) using the MiFish primers (Miya et al., 2015) that produce an amplicon of ca. 170 base pairs (bp) and the 16S rRNA (16S) with primers Ac16S that amplify a region of 330 bp (Evans et al., 2016) and ca. 100 bp-fragment with Fish16S primers (Deagle et al., 2009; Shaw et al., 2016). The aforementioned markers have provided an extraordinary wealth of information for community studies and species detection using eDNA (Miya, 2022; Shu et al., 2020). Building more comprehensive mitochondrial genome databases would be particularly advantageous for those species for whom identification cannot be resolved with short fragments of the 16S, 12S, or COI genes and to expand the representation to account for intra and interspecific variability. In addition, having more regions of the mitogenome available would facilitate an eDNA multi-marker metabarcoding approach, eDNA population genetics studies and even explore the possibility of mitochondria-associated disorders caused by mutations, an unexplored line of research that may provide insights into health or fitness questions (Brown, 2008; Dugal et al., 2021; Jackman et al., 2021; Jensen et al., 2020; Sharma & Sampath, 2019). Despite the mitochondrial genes being physically linked to each other, mitochondrial haplotypes from eDNA could determine minimum number of individuals and provide information about the origin of the populations (e.g. in the case of the stocks of anadromous fishes found in the ocean (Weitemier et al., 2021)). Additionally, whole mitochondrial genome databases, from verified, vouchered specimens, will also be critical for seafood monitoring, as molecular methods are routinely used to identify species in domestic and international trade (Bourret et al., 2020; Ogden, 2008). The expansion of mitochondrial genomic databases is not only costly and requires access to voucher specimens, but also relies in the use of so-called universal primers that have proven to be less generic than desired. Whole mitochondrial sequencing, although attainable, is still expensive and not readily accessible for all research groups due to a limitation in read length, non-affordable methods and long protocols (Gilpatrick et al., 2020).
In this study, we explore target enrichment methods to attain whole mitogenome sequence data in a simple and cost-effective manner. Current technologies allow for whole genome direct sequencing (i.e. no special treatments are necessary) that can yield regions of interest, particularly those in high copy number such as mitochondrial DNA, which can be identified and recovered from the data in a process called ‘genome skimming’ or shallow sequencing (Straub et al., 2012). We propose that targeted mitochondrial DNA enrichment during the DNA isolation process or library preparation step should be sought if willing to reduce costs, time, data storage capacity and bioinformatic capabilities while improving coverage and consensus sequences, avoiding pseudogenes and general background noise that can affect the molarity of the target and thus compromise the sequencing performance. In spite of hundreds to about a thousand mitochondrial DNA (mtDNA) copies present in a fish cell depending on the tissue and the age, which is lower than in mammals (Hartmann et al., 2011), the amount of mtDNA in a preparation is normally around 0.1% of the genomic DNA (Robin & Wong, 1988) and is overwhelmed with nuclear DNA (nDNA). Different enrichment outcomes can be attained depending on the DNA extraction, treatment, sequencing and bioinformatic approaches employed. The physical properties of mtDNA (enclosed organelle physical location and the circularity of the mitochondrial genome) can be used to preferentially extract mtDNA using sequential precipitation methods or to deplete the non-circular DNA (i.e., nDNA) using exonucleases. Targeted enrichment can be also conducted using CRISPR-based methods by targeting conserved regions of the mitogenomes with specific guide RNAs (Schultzhaus et al., 2021). Mitochondrial enrichment without PCR amplification avoids universal primer incompatibility and PCR amplification errors. Additionally, long range amplification is proving challenging (Ramón-Laca personal observations, (Gilpatrick et al., 2020)) and target enrichment using hybridization capture is not yet fully operative for long fragments.
Long fragment sequences can help diminish the number of nuclear mitochondrial sequences (NumtS) that can be very long and are found in fish in a greater ratio than in most vertebrate species (Antunes & Ramos, 2005; Dayama et al., 2014). Long fragments preserve the order of the genes, in contrast with the short reads sequencing platforms that can also be affected by PCR bias on AT-rich regions (Gan et al., 2019). Long sequences can be key for accurate genome assembly (Pollard et al., 2018), in particular in repetitive regions, which are sometimes found in the control region of the mitochondria and have proven challenging to sequence with traditional methods (i.e., Sanger and whole genome sequencing of short fragments) (Formenti et al., 2021; McDonald et al., 2021). A rearrangement in the order of the genes will not be missed if transferring annotations from a different species because the order of the genes is determined by the sequence and not the reference genome. For all the aforementioned reasons, a de novo assembly of whole mitogenomes should be favored, to not bias the newly generated mitogenomes and to not overlook any possible modifications.
However, long-fragment sequencing comes with its own challenges. The main downside, and a common criticism, of long-read sequencing with Oxford Nanopore technologies is the relatively high error rates in the sequences obtained. Nonetheless, in opposition with short-read platforms, these errors are mostly random except for the homopolymeric regions from single pore reads with ONT that can be overcome with high read depth (Pollard et al., 2018; Schultzhaus et al., 2021) and with the constantly evolving flow cells, chemistry and base calling algorithms. Collections of fish have traditionally preserved the specimens (e.g. whole individuals, fin clips) in jars or tubes with >95% ethanol. This method has worked well for gene sequencing or microsatellite or SNP typing, but does not prevent the degradation of the high molecular weight fragments (Oosting et al., 2020) and thus most specimens of the collections will not yield fragments of the desired length to hinder accidental sequencing of NumtS. In addition, fish samples can sometimes take long to be sorted even on board of dedicated research vessels, which can compromise tissue quality and lead to degradation of most of the high molecular weight DNA (Oosting et al., 2020; Rodriguez-Ezpeleta et al., 2013).
In this study we show how to generate whole mitogenomes from fish species, with the aim of generating affordable and comprehensive databases that are not restricted to a few genes of interest. The long-fragment approach combined with the mitochondrial DNA enrichment produced whole mitogenomes with full coverage and great sequencing depth while using fewer computational and sequencing resources than genome skimming. Two approaches that enrich the mitochondrial DNA were evaluated in this study: 1) Mitochondrial DNA enrichment by isolating intact mitochondria; and 2) Targeted mitosequencing by using CRISPR Cas9 on conserved regions of the mitochondrial genome. Both approaches are followed by sequencing on an Oxford Nanopore platform. These target enrichment and long-fragment sequencing approaches efficiently produce data for whole mitogenomes while using less computational and sequencing resources than genome skimming, simplifying the discovery of mitogenomes of non-model or understudied fish taxa to a broad range of laboratories worldwide.