2. Data processing and classification
A series of preprocessing steps were applied to the audio after
collection, beginning with normalizing all survey audio to -2 dB maximum
gain. The SWIFT recorder firmware writes a high amplitude audio spike at
the beginning of the first file recorded after the unit wakes from
standby (e.g., the beginning of the 5:00 and 16:00 audio files);
therefore, we chose to overwrite the first five seconds of audio on each
of these files to prevent this spike from impacting the gain
normalization step. As our chosen audio classifier architecture operates
on fixed-length samples, we split each 30 min audio file into 7197
2s-long overlapping audio “windows” that advance forward by 0.25
seconds per window. The classifier operates on the log-Mel-weighted
spectrogram (Knight et al. 2017) of each window, which is created
dynamically during classification using STFT utilities in the TensorFlow
python module (Table 2) at a native resolution of 512x512 px.
Audio event detection was conducted using a set of Convolutional Neural
Network classifiers. The chosen classifier architecture is adapted from
the multiclass single-label classifier called “Model 1” in Kahl et al.
(2017). Our decision to use a multiclass single-label classifier
architecture was driven by a desire for reduced learning complexity;
however we feel there is merit to introducing a multilabel classifier in
future analyses as existing ML techniques are capable of dealing with
this task with minor modifications (Kahl et al., 2017). For similar
reasons, we reduced the number of neurons per hidden layer by half to
account for limitations in available processing power, and also
down-sampled the 512x512 px spectrogram images to 256x256 px before
training and classification. The full classifier architecture is
described in Table 3. All data processing was performed either in
Python, using a combination of TensorFlow 2.0 (Abadi et al. 2016) and
other widely-used Python modules, or, in the case of later statistical
testing, in R (R Core Team, 2019). During training, we applied the same
STFT algorithm as used for the survey data to dynamically convert the
training audio to log-Mel-weighted spectrograms, and implemented data
augmentation to improve model generalization (Ding et al., 2016). These
augmentation parameters, along with general model hyperparameters (Table
4), were chosen using a Bayesian hyperparameter search module in Python
(Nogueira, 2014) that was driven to optimize the calculated multiclass
F1-score (Sokolova & Lapalme, 2009) (β = 1) on a set of known good
clips (hereafter the “validation set”) created using a sample of clips
not used in the training data (Table 1). F1-score was calculated as a
macro-average of the 12 classes in order to give equal weight to rare
classes. Although the goal of hyperparameter search techniques is
typically to identify an optimal set of parameters, we observed two
apparent local optima that we chose to incorporate into our
classification pipeline as two submodels: a. submodel 1, which added
artificial gaussian noise to training spectrograms as part of the
augmentation process and b. submodel 2, which did not. The set of class
probabilities returned for each clip was the mean of the probabilities
reported by the two models (hereafter the “ensemble”). We validated
each submodel, as well as the ensemble, on the same validation set used
in hyperparameter search.
When deployed on survey data, our classification pipeline yields
classifications as a sequence of probability vectors of size 12, where
each vector corresponds to one window in the sequence of overlapping
windows. Raw class probabilities for windows that contain only the very
beginning or the very end of tinamou vocalizations are often classified
incorrectly, which we believe results from the fact that different
tinamou species often share structural similarities with one another in
those regions of their vocalizations. To reduce the impact of this
pattern on our overall classification accuracy, we applied a
“smoothing” post-processing to the class probabilities where each
probability value was replaced by the weighted average of that value
(weight = 1) and the values immediately before and after it in the time
sequence (weight = 0.5). Windows with a maximum class probability
< 0.85 were removed, and the remainder assigned the label with
the highest class probability. All windows detected as positive were
manually checked for accuracy and relabeled if incorrect.
We assessed the degree of marginal improvement in classifier performance
due to increased training dataset size and increased structural
uniformity between training clips and survey audio by running a second
“pass” of the acoustic classifier on the survey data with a set of
models that had been trained using a larger training dataset. To
generate this dataset, the original training dataset was supplemented
with all known good positive windows from the initial classification
(the first “pass”). We sampled from this dataset to produce a new
training dataset (n = 18,480) with the larger of 2000 randomly selected
clips (4000 for the “junk” class), or as many clips as were available,
per class (Table 1). We trained new submodels on this data using the
same model architecture and hyperparameters that were used for models in
the first pass. The sole change made to the training process between
classifications 1 and 2 was to alter the batch generation code to
produce batches with balanced class frequencies to offset the greatly
increased degree of class imbalance in the supplemented dataset. Each
submodel was validated using a new validation set that contained
known-good survey audio whenever possible in order to ensure that the
calculated metrics would be more indicative of each submodel’s real
world performance.
The survey data was classified with these new models, and the resulting
class predictions were processed to extract probable detections as
described previously. In order to decrease labor time, all positive
windows from the initial classification were “grandfathered in” as
correctly identified due to having been manually checked previously,
which allowed us to only check positive detections that were newly
identified during the second pass. Finally, all sequences of windows
with a particular species classification that were >= 0.75s
apart from any other sequence were grouped as a single vocal event.
For the purposes of quantifying model performance and generalizability,
we calculated a precision, recall, F1-score, and precision-recall area
under the curve (AUC) performance metrics for the primary and secondary
models, presented on a per-class basis or as macro-averages across
classes, after Sokolova & Lapalme (2009). All metrics were calculated
based on classifier performance on a set of known good clips (hereafter
the “validation set”), using data from the survey audio whenever
possible in order to ensure that the performance metrics would be more
indicative of each submodel’s real world performance.
As a point of comparison for our audio detection counts, we also
examined community science observation data for tinamous from eBird
(Sullivan et al., 2009; Sullivan et al. 2015). We used stationary and
traveling checklists containing tinamous that were submitted at the LACC
hotspot between the months of July and October, removing stationary
checklists with durations > 150 min and traveling
checklists with lengths > 0.5 km in order to constrain the
sampling effort parameter space of the eBird data such that it was more
comparable to our 2.5 h morning and afternoon recording periods. Despite
these filtering steps, the final eBird dataset still contained all
locally-occurring tinamou species. However, it was clear that our
acoustic data density for C. strigulosus vastly outstripped eBird
data density, so we excluded this species from our analysis as we feel
it warrants separate discussion. We produced estimates of occurrence
probabilities by averaging the results of random samples from the eBird
data (n = 1,000) and averaging the results of the same number of samples
from the acoustic event dataset using the same underlying sampling
effort density distribution as the eBird checklist durations. Audio
frequency estimates were calculated separately for terra firme and
floodplain habitat types on a site-level presence-absence frequencies
and then averaged. In addition, we compared our audio detection counts
to camera trap capture rates reported by Mere Roncal et al. (2019), also
at LACC. Camera trap capture rates suggest seasonally-driven differences
in tinamou activity rates, so we only considered detection rates from
the dry season, which limited our comparison to the five tinamou species
reported by Mere Roncal et al. (2019) for which dry season camera trap
data is available. Occurrence frequencies were again calculated as the
average of the distributions from terra firme and floodplain sites.