Introduction:
Animal vocalisations come together with natural and human-made sounds to form soundscapes, which can be used to monitor species populations or infer community-level metrics such as biodiversity (Roca and Proulx, 2016; Eldridge et al. , 2018; Gómez, Isaza and Daza, 2018). Such monitoring is crucial to effectively respond to threats (Rapport, 1989; Rapport, Costanza and McMichael, 1998). Previously, the use of in situ expert listeners to monitor species presence and abundance was common (Huff et al. , 2000) but: is costly and time-consuming; can damage habitats; and is prone to narrow focus and observer bias (Fitzpatrick et al. , 2009; Costello et al. , 2016). Advances in portable computing now permit remote recording of soundscapes, but produce a volume of data that precludes manual review, leading to the development of automated, or semi-automated, methods of analysis (Towsey, Truskinger and Roe, 2016; Sethi et al. , 2020).
Soundscape composition is primarily assessed using acoustic indices – summary statistics that describe the distribution of acoustic energy within the recording (Towsey et al. , 2014) – and over 60 Analytical Indices which capture aspects of biodiversity have been developed (Sueur et al. , 2014; Buxton et al. , 2018). These are commonly used in combination to compare the occupancy of acoustic niches, temporal variation, and the general level of acoustic activity (Bradfer‐Lawrence et al. , 2019) across ecological gradients or in classification tasks (Gómez, Isaza and Daza, 2018). These approaches have provided novel insight into ecosystems across the world ( Fulleret al. , 2015; Buxton et al. , 2016; Eldridge et al. , 2018; Sueur, Krause and Farina, 2019) but are not foolproof and often have poor transferability (Mammides et al. , 2017; Bohnenstiehlet al. , 2018). This may result from a lack of standardisation: differing index selection, data storage methods, and recording protocols, which all lead to unassessed variation in experimental outputs (Araya-Salas, Smith-Vidaurre and Webster, 2019; Bradfer‐Lawrenceet al. , 2019; Sugai et al. , 2019).
The AudioSet convolutional neural net (CNN; Gemmeke et al. , 2017; Hershey et al. , 2017) is an attractive replacement for Analytical Indices. This pre-trained, general-purpose audio classifier generates a multi-dimensional acoustic fingerprint of a soundscape that is a more effective ecological descriptor (Sethi et al. , 2020). AudioSet is trained on two million human-labelled anthropogenic and environmental audio samples, potentially giving it both greater transferability and discrimination than typical ecoacoustic training datasets.
In ecoacoustics, a continuous uncompressed or lossless recording is generally recommended (Villanueva-Rivera et al. , 2011; Browninget al. , 2017), but generates huge files. We consider two commonly used approaches to reducing storage requirements (Towsey, 2018). Firstly, MP3 compression, which is widely used in ecoacoustic studies (e.g. Saito et al. , 2015; Zhang et al. , 2016; Sethi et al. , 2018): this lossy encoding removes acoustic information inaudible to human listeners but is suspected of removing ecologically important data (e.g. Towsey, Truskinger and Roe, 2016; Sugai et al. , 2019). Araya-Salas, Smith-Vidaurre and Webster (2019) have recently shown that ecological information is lost under high compression from recordings of isolated animal calls, however it is not known if this extends to recordings of noisier whole soundscapes.
Secondly, recording schedules also vary in ecoacoustic studies (Sugaiet al. , 2019). Bradfer‐Lawrence et al. (2019) showed that longer and more continuous schedules give more stable Analytical Index values. However, ecoacoustic composition varies with time of day (Fulleret al. , 2015; Bradfer‐Lawrence et al. , 2019; Sethiet al. , 2020) and so separating recording windows may reduce temporal variation and improve classification (Sugai et al. , 2019) even with reduced data. Similarly, index calculation on longer recordings may average away anomalous calls and short term patterns.
While clear standards are crucial for collaborative research in ecoacoustics, there is uncertainty in the literature on the impacts of the selection of index type, compression level and recording schedule. Here, we:
contrast the classification accuracy of index selection choices;
describe the effects of both compression, recording length and temporal subsetting on the values, variance and classification performance of indices.
In describing how well ecological information is stored in acoustic data under different recording decisions, we identify stronger standards to improve both performance and provide a basis for more extensive meta-analysis.
Methods and Materials
Study Area
Acoustic samples were collected in Sabah at the Stability of Altered Forest Ecosystems (SAFE) project: a large-scale ecological experiment on habitat loss and fragmentation effects on tropical forests (Ewerset al. , 2011) with sites in the Kalabakan Forest Reserve (KFR). Historically, logging within KFR has been heterogeneous, reflecting habitat modifications in the wider area (Struebig et al. , 2013), with higher than typical timber extraction rates. Habitat ranges from areas of grass and low shrub, through logged forest to almost undisturbed primary forest.
Soundscape Recording
Data were collected from three KFR sites representing a gradient in above-ground biomass (figure 4a) (AGB: Pfeifer et al. , 2016): primary forest ( AGB= 66.16 t.ha-1), logged forest (AGB = 30.74 t.ha-1), and cleared forest (AGB = 17.37 t.ha-1) (Supplementary 1). We recorded for an average of 72 hours at each site (range: 70 to 75) during February and March 2019 (Supplementary 2a). No rain fell during the recording period, so no recordings were excluded due to confounding geophony (Zhang et al. , 2016). In all sites, omnidirectional (AudioMoth, Hill et al. , 2018) recorders were attached to trees (~ 50 cm diameter and 1-2 m above the ground) and recorded continuously using 20-minute uncompressed samples (‘raw’, .wav format) at 44.1kHz and 16 bits.
Compressing and Re-Sizing the Raw Audio
Continuous 20-minute recordings were first split into recordings with a length of 2.5, 5.0 and 10.0 minutes, using the python packagepydub (Webbie et al., 2018) (Fig. 1b). The audio was then converted to lossy MP3 format using the fre:ac LAME encoder under two standard LAME MP3 encoding techniques: constant bit rate (CBR) and variable bit rate (VBR) compression (Fig. 1c). CBR reduces the file size to a specified number of kilobits per second; VBR varies bitrate per second depending on the analysis of the acoustic content and a quality setting (0, highest quality, larger bitrate; 9 lowest quality, smaller bitrate). Since bitrates are not directly comparable between VBR and CBR – and because storage savings are often the principal driver of compression choices – we use compressed file size as our measure of compression level. We used VBR0 and CBR320, CBR256, CBR128, CBR64, CBR32, CBR16 and CBR8, resulting in file sizes ranging from 41.6% (CBR320) and 1.04% (CBR8) of the original raw file size and some reductions in maximum coded frequency (Table 1). We do not consider lossless compression, as the storage capacity is much higher and the files are obligatorily the same post decompression. Previous studies have also found that the lossless compressed audio is largely identical to raw audio (Linke and Deretic, 2020).