Step 2: Selecting appropriate gene region(s) and creating
a reference database
The standard DNA barcode for plants is rbcL and matK (CBOL
Plant Working Group et al., 2009), but other gene regions such astrnL , ITS1 and ITS2 are often used, especially for DNA
metabarcoding as they have shown deeper resolution for several taxonomic
groups (Wilson et al., 2021). Increasing the number and length of gene
regions is expected to improve taxonomic resolution, and some studies on
pollen mixtures have used whole plastid genomes (Lang et al., 2019) or
whole genomes (Bell, Petit, et al., 2021). The barcode gene region(s)
must be suitable for differentiating the taxa in the study system to the
taxonomic level required and have primers universal enough to amplify
all taxa. Recent research has shown a combination of rbcL and
ITS2 to work better than other combinations for detecting the highest
number of species at the lowest level of taxonomic discrimination
(Jones, Twyford, et al., 2021). Several studies have worked towards
developing more universal primers for ITS2 (e.g., Kolter &
Gemeinholzer, 2021; Moorhouse-Gann et al., 2018) and primers to amplify
shorter regions to account for pollen degradation in historical samples
(Simanonok et al., 2021). More gene regions (and partial or whole
genomes) lead to improved taxonomic resolution but also require more
work in assembling the reference library. Using multiple gene regions
has the additional challenge of determining the best method of combining
results from different markers, with differing taxonomic resolution, for
downstream analysis.
It is essential for the gene regions to have a comprehensive reference
library for the species in the study system. Using a custom, relevant
reference library reduces misidentifications and increases the accuracy
of taxonomic assignments (Arstingstall et al., 2021; A. Keller et al.,
2020). Software such as the BCdatabaser (A. Keller et al., 2020) for
creating custom databases from species lists can be helpful where there
is no national database, different gene regions are being used, or a
more local database is desired. In addition, the recent software
MetaCurator of Richardson, Sponsler, McMinn‐Sauder, and Johnson (2020)
has two advantageous features to curate existing or generated reference
datasets: 1) identifying the exact amplicon of interest and trimming
away extraneous sequence to avoid non-overlapping amplicons of the same
gene, and 2) dereplicating sequences by taxonomy so that barcodes are
retained for multiple species even when there is no barcode gap (i.e., a
higher and non-overlapping range of sequence divergence between species
than among species). A consideration of this and many other methods is
that they create a static database that needs to be updated frequently
as new sequences become available.