Another crucial question in machine learning study is how close the performance difference is between a machine learning model and a human. In practice, human scorers are not perfect, especially for the tedious and time-consuming SvH scoring. The clinical upper limit of this RA scoring problem can be estimated by calculating the difference between two human scorers. We therefore calculated Pearson’s correlation between scores from two trained human professionals. The performance of our method is compared with this clinical upper limit (Fig. 10). For joint space narrowing, our method achieved comparable performance with human scorers. For joint erosion, there is still space to improve machine learning methods. Overall, these results indicate that our method is closing the gap between computers and humans for the joint damage scoring problem, especially for joint space narrowing.  

Discussion

In this study, we develop a novel machine learning approach for integrating multiple images and automatically quantifying joint space narrowing and bone erosion in rheumatoid arthritis. We designed a special neural network architecture that simultaneously scores joint damage levels and segments the joint space regions. This design not only significantly improves the prediction accuracy but also highlights the regions of interest to assist further analysis in clinical settings. The idea of introducing segmentation into an image-based regression deep learning model should not be limited to the joint damage scoring in RA. In fact, many biomedical imaging problems have a similar situation - the quantification or diagnosis closely depends on parts of an image. The segmentation of the disease-related regions of interest will be crucial to guide a neural network to focus on those regions, improve the performance, and facilitate subsequent error analysis or clinical diagnosis. This is especially true when the sample size is relatively small and the segmentation will serve as a “teacher” that helps the model training well with a limited number of samples.
Inspired by the widely observed symmetrical symptoms in patients with RA, the method we developed learns multilevel symmetry and dependence across images. This approach is novel in that it seamlessly integrates multiple layers of information from different images to guide prediction, which can be extended to other medical image fields. Additionally, we investigated the relationships among joints and damage types in our machine learning model and revealed the disease-specific map; this data-driven RA-specific map is instructive to clinical decisions. This study design can be applied to many biomedical imaging problems and biological studies, with or without symmetrical patterns. Through analyzing the contributions of different, multiple images used in a machine learning model, the hidden relationships between different disease manifestations will be revealed from a new computational perspective, complementing direct experimental observations and current knowledge.
Although many deep learning models have been developed for image-based joint damage detection, there is still room for improvement. Top-performing methods in the DREAM Rheumatoid Arthritis Challenge, including ours, consist of multiple steps [41]. Multi-step methods require more human designing of multiple modules, whereas end-to-end methods have more simplified workflows and are easier to deploy without external priors and constraints in clinical settings [42]. Ideally, end-to-end deep learning algorithms should be designed to simultaneously output damages scores of joint space narrowing as well as bone erosion. Yet the performance of end-to-end approaches without hand-engineered components will be largely limited by the much smaller size of images, compared to millions of images in ImageNet [43]. Moreover, unlike simple objective detection tasks in computer vision, the nature of detecting multiple joints and two types of damages within an image largely complicates the problem. In practice, a fully automatic deep learning system for biomedical image analysis can significantly benefit many clinical disciplines in terms of efficiency and cost-effectiveness [44]. However, complete automation remains a translational gap, where human-in-the-loop computing can be beneficial for many biomedical image problems [45]
In terms of detecting bone erosion in RA, we observe a gap between our method and the clinical upper limit. The major limitation is the relatively low quality of erosion scores. In fact, Pearson's correlation between erosion damage scores (~0.78) by two trained human professionals is significantly lower than that of joint narrowing scores (~0.88). This indicates that scoring bone erosion damage is in nature more difficult. Therefore, more inconsistency is observed between human experts. In our computational pipeline, a key component to improve performance is the segmentation of damaged joint regions. In contrast to the straightforward segmentation of joint space regions, the determination of bone erosion regions may vary between humans. If high-quality segmentation of bone erosion as well as more consistent erosion scores are available in future studies, the prediction performance will further improve to close the gap between artificial intelligence methods and the clinical upper limit.

Conclusion

We develop an AI approach for automatic scoring joint space narrowing and bone erosion. Through semantic segmentation of the joint space region as well as integration of multilevel interconnection across joints and damage types, our method achieved high prediction accuracy in quantifying joint space narrowing, approaching the clinical upper limit of this problem.

Acknowledgements

This work is supported by NIH/NIGMS R35GM133346 and NSF/DBI #1452656.

Conflict of interest

The authors declare no competing interests.