Table 1: Bitrate, percentage file size reduction and maximum encodable
frequency for the experimental compression levels.
Quantification of Soundscapes Using Indices
Analytical Indices
We used the seewave (Sueur, Aubin and Simonis, 2008) andsoundecology (Villanueva-Rivera and Pijanowski, 2016) packages in
R (ver 3.6.1; R Core Team, 2020) to extract 7 Analytical Indices (Fig.
4d): Acoustic Complexity Index (ACI), Acoustic Diversity Index (ADI),
Acoustic Evenness (AEve), Bioacoustic Index (Bio), Acoustic Entropy (H),
Median of the Amplitude Envelope (M), and Normalised Difference
Soundscape Index (NDSI) (Supplementary 3). These have been shown to
capture diel phases, seasonality, and habitat type (Bradfer‐Lawrenceet al. , 2019). These indices could not be calculated for all
recordings due to file reading errors, however, this fault occurred in
0.3% of all recordings (Supplementary 2b).
AudioSet Fingerprint
The audio was converted to a log-scaled Mel-frequency spectrogram after
16kHz downsampling and then passed through the “VGG-ish” Convolutional
Neural Network (CNN) trained on the AudioSet database (Gemmeke et
al. , 2017; Hershey et al. , 2017) (Fig. 1d). This generates a
128-dimensional embedding and the 128 values in that embedding describe
the soundscape of given recording in an abstracted form or fingerprint.
Similarly, as in the Analytical Indices, some recordings could not be
analysed by the AudioSet CNN, however, this was only in 0.2% of
recordings (Supplementary 2b).
Data Analysis
Impact of Index Selection: Auto-Correlation
Analytical Indices often summarise similar features of a soundscape
(e.g. dominant frequency and frequency bin occupancy): this overlap may
reduce the descriptive scope of the ensemble. We compare the degree of
pairwise correlation between the individual Analytical Indices and
between the individual features of the AudioSet Fingerprint. We also
compare how well each index/feature correlates with the maximum
recordable frequency (Fig. 1e).
Impact of Compression: Like-for-Like Differences
We use an adaption of Bland-Altman plots (Vesna, 2009, Araya-Salas,
Smith-Vidaurre and Webster, 2019) to visualise the scaled difference
(D ) between raw (\(I_{\text{raw}})\ \)and compressed
(\(I_{\text{com}}\)) index values, as a percentage of the range of raw
values \(R_{\text{raw}}\) (Fig. 1f) :
\begin{equation}
D=\frac{I_{\text{com}}-I_{\text{raw}}}{R_{\text{raw}}}\times 100\nonumber \\
\end{equation}D was not normally distributed (Supplementary 5a), so median and
inter-quartile ranges are reported. We determine that an index has been
altered as a result of compression to be when: i) the interquartile
range of D does not include zero difference or ii) medianD is more than +/- 5% of the Rraw . We use
Spearman rank correlation to test for a consistent trend in Dwith increasing compression. To reflect their common use cases, Dfor Analytical Indices is calculated from the univariate values, while
for AudioSet Fingerprints – which is intended as a multidimensional
metric – \(D\) is calculated separately for each dimension and then
averaged.
Impact of Recording Schedule: Recording Length
Recordings of longer length may have a reduced variance due to the
smoothing of transient audio anomalies (such as bird calls). We tested
this by comparing the variance of the recording groups at different
recording lengths. The index values are non-normally distributed so we
use a Levene’s test for homogeneity of variance (Fig. 1g).
Impact of Parameter Alteration on Classification Task
We use random forest classification models to assess how well the
soundscapes are represented by each index type under each different
experimental parameter, using the RandomForest (Liaw and Wiener
2002) package in R (Fig. 1h). Models were trained on a middle 24 h
period of data from each site and tested on the remaining 46+ h of
audio. We used 2,000 decision trees to ensure accuracy had stabilised.
The model was trained and tested separately for every combination of
index type (Analytical Indices vs. AudioSet Fingerprint), compression
level and recording length. We determined accuracy, precision and recall
of each combination.
Impact of Temporal Subsetting
Soundscapes typically show considerable dial variation in both abiotic
and biotic components. To assess the impact of this variance on model
performance, we split our recordings into four 6-hour sections centred
on Dawn (06:00), Noon (12:00), Dusk (18:00) and Midnight (00:00) and
then further subdivided these into 3 hour (8 sections) and 2 hour (12
sections) blocks (Fig. 1i). We trained and tested the random forest
model again on each of the temporal sectioned recordings, with each
section used to build models individually, and determined accuracy,
precision and recall as before.
Modelling the Impact of all Parameters on Accuracy Metrics
As the accuracy metrics are bound between 0 and 100%, we used a beta
regression to model the relationship between each of the experimental
parameters and performance metrics (Douma and Weedon, 2019). The model
was built using the betareg package in R (Cribari-Neto and
Zeileis, 2010). To avoid fitting issues when performance measures are
exactly 1, we rescale all performance measures using m’ = (m (n-1) +
0.5) / n, where n is sample size (Smithson & Verkuilen, 2006). The
model includes pairwise interactions between file size, temporal
subsetting, and recording length, and then all interactions of main
effects and those pairwise terms with the index selection. We observed
that variance in performance measures varied as an interaction of both
index choice and a temporal subsetting (Supplementary 8a), so tested the
inclusion of these terms in the precision component of the model. We
first treated frame size and temporal subsetting as factors, but also
tested a model considering these as continuous variables. We found the
Akaike Information Criterion (AIC) was markedly lower in a beta
regression model using factors and including the precision component
(Supplementary 8b).
Results
Although Spearman pairwise correlations of Analytical Indices and
maximum recordable frequency were low on average (mean = 0.32, IQR =
0.22), we found some strongly correlated sets of indices (Fig. 2). ADI,
Bio and NDSI all show strong similarities and are closely correlated
with maximum recordable frequency; AEve and H are also strongly
correlated (Fig. 2). Some features of the AudioSet Fingerprint correlate
with each other and maximum frequency but in general, these features are
more weakly correlated (mean = 0.14, IQR = 0.18, Fig. in Supplementary
4b).
Impact of Compression
Impact of Compression: Like-for-Like Differences
All indices showed both observable differences under compression and
clear trends with increasing compression (confirmed with Spearman’s rank
correlation, all p < 0.001, Supplementary 5b). The mode of
response showed three broad qualitative patterns, illustrated here using
results from the 5-minute audio sample (other recording lengths in
Supplementary 5a). (1) Indices which were only affected above a
threshold level of compression (AudioSet Fingerprint: CBR16; M: CBR32;
and NDSI: CBR8). These indices typically showed low absolute D(median D typically <15%). (2) AEve and H showed the
biggest differences at an intermediate compression (CBR64) and
relatively low absolute differences (median D typically
< 30%). (3) The remaining indices showed a variety of
responses: ADI showed a monotonic response above a threshold, ACI showed
changes up to CBR64 and then stabilises, and Bio showed a stepped
pattern of increase. However, all three showed increasing and large
changes in absolute D (median D often > 75%) with
increasing compression.
Impact of Recording Schedule: Recording Length
Three out of seven (43%) of the Analytical Indices (ADI, AEve and H),
and a smaller proportion of the AudioSet Fingerprint values (46 out of
128; 36%) were found to have non-homogeneous variance in groups of
different recording length (p < 0.05, Levene’s test for
homogeneity of variance, Supplementary 6b).
Impact of Index Selection
Classifiers derived from 5-minute recordings using raw audio showed
higher accuracy for AudioSet Fingerprint (93.8%) than Analytical
Indices (80.9%, Table 2). This advantage held across all recording
lengths and performance metrics with performance gains of around 12-13%
in accuracy, precision and recall (Supplementary 7b).
Compression decreased accuracy for both AudioSet Fingerprint (CBR8:
90.8%) and Analytical Indices (CBR8: 75.1%, Table 2). Classifiers
trained on compressed AudioSet Fingerprint, however, still
outperformed those trained on uncompressed Analytical Indices.
For both indices, this reflected a decreased ability to differentiate
logged and primary forest. Interestingly, both indices showed better
discrimination between cleared land and logged forest under strong
compression. These patterns were repeated across recording lengths
(Supplementary 5a).