Results

Overview of experimental design

Scoring joint damages in RA is a complex task with two subtasks: (1) object detection - we need to detect and locate multiple joints within an image, and (2) disease recognition - we need to predict the degree of damage through regression analysis. To solve this unique problem, we developed a multi-step pipeline (Fig. 1). We first built a deep convolutional neural network model to identify the location of each joint. Once we obtained the location, an image was cut into small patches that were centered around joints. Then the image patches were used as the input for a specially designed neural network to perform regression analysis of joint damage scores, as well as semantic segmentation of joint space regions. Notably, segmentation was not required, yet it significantly improved predictive performance. Finally, patients with RA are likely to develop symmetrical symptoms in both sides of hands and feet, and joint space narrowing and bone erosion often go hand in hand. We therefore developed a tree-based conventional machine learning model to integrate all available information from both sides and two types of damages. The step not only further improved the performance but also revealed cross-joint prediction relationships that are the nature of RA.

Convolutional neural network locates joint positions with high accuracy

We first built a 2D convolutional neural network model to detect the location of a joint (Supplementary Fig. 1). Specifically, for each joint to be located, we used a 30-by-30 pixel-wise square mask as the ground truth label to present the location and the entire image as the input. This convolutional neural network has an encoder that extracts feature maps at multiple resolutions and scales through convolutional layers, as well as a decoder that decrypts the abstract information within feature maps through up-convolutional layers. Meanwhile, concatenation layers are used to link the encoder and decoder to prevent information decay.
To evaluate the predictive performance of joint location, we performed 10-fold cross-validation experiments for each joint in the finger, wrist (Fig. 2a), and toe (Fig. 2b). Since the sizes vary across images, we used a normalized distance to measure the difference between predictions and ground truth labels (see details in Methods). Briefly, the coordinates of each point within an image are rescaled from pixel values to a continuous value between 0 and 1 by dividing the height or width of an image, so that the results are uniform and comparable across images with different sizes. The distributions of normalized distances are shown as boxplots in Fig. 2c-e. We selected a normalized distance of 0.02 as the cutoff (horizontal red dashed lines) to measure the predictive accuracy. Examples of normalized distances of 0.01 and 0.02 are shown in Fig. 2f-g. In general, joints from fingers are easier to locate and more than 98.0% of testing joints are within the 0.02 normalized distance (Fig. 2c). In contrast, joints from wrists are harder to distinguish, owing to their proximity (Fig. 2d). The accuracy of locating joints from toes is similar to that from fingers (Fig. 2e). Overall, the convolutional neural network model locates joints with high accuracy.

Segmentation of joint space regions significantly improves joint damage prediction

Based on the joint location obtained from the previous step, we cut full images into image patches that were centered around joints to be scored. Then the patches were treated as the input for a deep learning model to predict the damage score of each joint. We design a novel neural network architecture for the patch-based damage prediction that simultaneously outputs the damage score as well as the segmentation of the joint space region (Supplementary Fig. 2). Specifically, the architecture contains two parts. The first part includes an encoder and a decoder so that it extracts features from multiple scales and resolutions. The output of the first part is the segmentation mask that is further used as the input for the second part of the neural network. The rationale is that with the guidance of the segmentation mask, the neural network will easily learn where to “look at” and focus primarily on the regions of interest to determine the damage level. The second part contains a regressor that generates one output value representing the damage score. 
To comprehensively evaluate the predictive performance and investigate the effect of the special neural network architecture, we performed 10-fold cross-validation experiments on hands and feet individually. The primal evaluation metric is Pearson's correlation between predictions and ground truth labels created by human professionals. We also considered a secondary metric, the root mean square error (RMSE) between prediction and ground truth. We benchmarked the neural network models with or without the segmentation of the joint space region. In both hands and feet, the neural network with segmentation achieved significantly higher correlations than the network without segmentation, as well as lower RMSEs (Fig 3a-b).
Since joint space narrowing and erosion often occur simultaneously, a joint with narrowing damage is likely to have bone erosion. We therefore hypothesize that the neural network architecture with segmentation will also improve the prediction accuracy of erosion damages. Similar to joint space narrowing, this model significantly increased the correlations and decreased RMSEs for erosion prediction in both hands and feet (Fig 3c-d). 
Intuitively, integrating segmentation of bone erosion regions should also improve performance similar to including segmentation of joint space regions. Unfortunately, the segmentation of erosion did not improve the performance (Supplementary Fig. 3). The main reason is that the segmentation of erosion is much harder than narrowing and the quality of segmentation is relatively low. We anticipate that this neural network will further improve with high-quality erosion masks in future studies.