Appendix 2: Bounding Boxes

We used bounding boxes to establish ground truths in our study to increase the value of images, allowing us to use far fewer images to train our model. Bounding boxes provide the model with the location of each object dictating the bounds of the object and background noise (SI Fig. 3, human labeled). Providing the model with images without bounding boxes makes it more difficult for the model to distinguish commonality in patterns of similar objects and would further complicate identification when repeated, uncorrelated, confounding objects or background noise are present.
Once trained, the model will identify and classify all objects by placing bounding boxes: a box, the corresponding label of that object, and a feature score. A feature score is the percent likelihood that the object detected reflects the respective label. Our model correctly identified the objects in images 1-3. The model can be more precise than human labelers in finding objects, for example, image 6 displays correctly labeled tail feathers of a turkey that were not labeled correctly by human labelers. Additionally, the model may pick up objects incorrectly (image 5) with low confidence. The confidence threshold (CT) was set at 50%, so any objects detected with over 50% confidence were displayed. This CT can be adjusted to negate low confidence objects, but during training can give insights into errors that may impact validation accuracies and F-1 score. For example, in image 5, images with the same background or images of grey squirrel can be added to further distinguish the misidentified object. Image 4 shows an example of object splitting, when one object is identified by two bounding boxes. Object splitting creates problems with counting the correct number of individuals in an image. Again, adding additional similar images of an event where object splitting can occur can increase the chances of correct bounding boxes. These types of discrepancies suggest the need for a combination of human labelers and AI prescreening for a completely thorough analysis of camera trap imagery.