Methods

Camera Trap Study

The subset of images used to train the model was pulled from a camera trap study consisting of 170 cameras, which were deployed for up to three years across two regions of South Carolina (see Supplementary material Appendix 1 for camera trap study details). We acquired images for the train and test datasets from 50 camera locations from each region within two separate one-month time frames. The complete consisted of 5,277 images of 17 classes, including images from both winter and summer months to account for seasonal background variation (Table 1). True negative images were not included because they would not assist in teaching the model about any of the species classes. A commonly used 90/10 split (e.g. Fink et al. 2019) was utilized to create the training and testing datasets from the selected images; 90% of images were used for training and 10% were used for testing.

The basic process (Fig. 1) included selecting and labeling a subset of images from our camera trap image repository (See Supplementary material Appendix 1 for details) for transfer training, in order to adapt a pre-made neural network to our image set. The subset of images used to train the model was pulled from a camera trap study consisting of 170 camera stations which had been deployed for up to three years in two regions of South Carolina (see Supplementary material Appendix 1 for camera trap study details). To begin, a subset of images was created by selecting 500 images of each species in a variety of positions within the field of view (Fig. 1, Step 1). In cases where classes (species being classified) reached 500 images, only images that contributed a unique perspective of the animal were added to the training dataset, in order to supply the model with a better generalization of the animal. The number of images in the training data set was limited to ensure the model did not favor one due to the number of images in the dataset. Despite adding more than 500 images to some classes, class weights were not influenced and remained comparable.
Feature Extraction
se of a supervised training process increases the accuracy of detection and classification by human-generated bounding boxes (Supplementary material Appendix 2). LabelImg (Tzutalin 2015), a graphical image annotation tool, was used to establish ground truths (locations of all objects in an image) and create the records needed for our supervised training process. This software allows a user to define a box containing the object and automatically generates a CSV file with the coordinates of the bounding box as well as the class defined by the user.

Classification Training

A transfer training process to adapt a premade neural network (Fig. 1, Step 3) was employed to create an identification and classification model. We transformed the CSV file generated by the feature extraction process into a compatible tensor dataset for the training process through the appropriate methodologies laid out in the Tensorflow (Abadi et al. 2015) package description. Tensorflow is an -source, experimental Python library from Google for identification and classification models. The Tensorflow transfer training process required a clone of the Tensorflow repository, in combination with a customized model configuration file defining parameters(Table 2).

Training Evaluation

The degree of learning that was completed after each step was analyzed using intersection over union (IOU) as training occurred (Krasin et al. 2017). A greater IOU equates to a higher overlap of generated predictions versus human labeled regions, thus indicating a better model (See Supplementary material Appendix 3). Observing an asymptote in IOU allowed for the determination of a minimum number of steps needed to train the model for each class and to assess which factors influenced the training process (e.g. feature qualities, amount of training images). Because the minimum step number was not associated with image quantity in determining step requirements, we relied on quality assessments, such as animal size and animal behavior.
Following training, final discrepancies between the model output and the labeled ground truths were summarized into confusion matrices (generated by scikit-learn, Table 3) including false positives, false negatives, true positives, true negatives, and misidentifications. Several metrics were calculated to evaluate aspects of model performance (Fig. 2). Relying on accuracy alone may result in an exaggerated confidence in the model’s performance, so to avoid this bias, the model’s precision, recall, and F-1 score were also calculated. Precision is a measure of FPs while recall is a measure of FNs, with F-1 being essentially an average of the two (Fig. 2). Due to the large proportion of TNs associated with camera trap studies, F-1 score does not include TNs in order to focus on measuring the detection of TPs.
In addition, the metrics were further separated into evaluations for identification and classification purposes. Identification (ID) models would focus only on finding objects and therefore deem misidentifications as correct because the object was found. Classification (CL) models would not deem misidentifications as correct. Finally, accuracy, precision, recall, and F-1 were calculated at a variety of confidence thresholds (CT), a parameter constraining the lower limit of confidence necessary for a classification proposal, to determine the threshold that resulted in the highest value of the metric we wished to optimize.

Validation

To confirm results acquired from testing the model, it was essential to evaluate a validation set of images. This validation set was formed by randomly selecting five cameras from a 12-week period separate from the training dataset, but within the same larger dataset. The validation subset consisted of 10,983 images, including true negatives. The set ran using the optimal CT for F-1 score determined by the test data. These images were also labeled using labelImg to automate the calculation of evaluation metrics. The validation set scores and test scores should be compared to determine if the model is overfitted, meaning the test set is not representative of the validation set. Possible reasons for such a mismatch may be that the background environment has changed dramatically or species not included in the test set have appeared.