We first built a 2D convolutional neural network model to detect the location of a joint (Fig. 2). Specifically, for each joint to be located, we used a 30-by-30 pixel-wise square mask as the ground truth label to present the location and the entire image as the input. This convolutional neural network has an encoder that extracts feature maps at multiple resolutions and scales through convolutional layers, as well as a decoder that decrypts the abstract information within feature maps through up-convolutional layers. Meanwhile, concatenation layers are used to link the encoder and decoder to prevent information decay.