Experimental Section/Methods
Data collection
In this study, we used radiographic images from two clinical studies, the Consortium for the Longitudinal Evaluation of African-Americans with Rheumatoid Arthritis (CLEAR)
[39] and the Treatment Efficacy and Toxicity in Rheumatoid Arthritis Database and Repository (TETRAD)
[40]. A total of 1472 radiographic images from 368 sets were used to develop models. Each set consisted of two images of hands and two images of feet from the same individual. Two types of joint damages were investigated: joint space narrowing and bone erosion. The ground truth label is the Sharp/van der Heijde (SvH) score generated by human experts through manual inspection of images
[41]. Typically targeted joints by RA were examined by the SvH scoring system, including multiple joints in wrists, proximal interphalangeal (PIP), and metacarpal phalangeal (MCP) of the fingers, PIP and metatarsal phalangeal (MTP) of the toes. For joint space narrowing, 15 joints from each hand and 6 joints from each foot were assessed, with the score ranging from 0 to 5. A higher score represents a higher degree of damage, where 0 is normal, 1 is focal narrowing, 2 is the reduction of less than 50% of the original joint space, 3 is the reduction of more than 50% of the original joint space, and 4 is complete dislocation. For bone erosion, 16 joints from each hand and 6 joints from each foot were assessed, with the score ranging from 0 to 5. Similarly, the score of 0 represents no damage, 1 is discrete erosion, 2-3 are large erosions that involve the bone surface, 4 is erosion that extends over the middle of the bone, and 5 is a complete collapse. To estimate the clinical upper limit of the damage scoring problems, each joint damage was scored independently by two trained professionals. Pearson's correlation and root mean square error (RMSE) between two scorers were calculated as the clinical upper limit.
Location of joints and segmentation of joint space regions
For each joint of interest, we first manually labeled the coordinate of the center. Then based on the center coordinate, we generated a 30-by-30 pixelwise square mask as the ground truth. These location masks were used to build models to locate a joint. In addition, we also manually labeled the joint space regions using polygons. These polygons served as the segmentation masks of joint space regions, which were used to build models that accepted image patches as input.
Evaluation and experimental design
Two evaluation metrics were considered to assess the performance of predicting joint damages. The first metric is Pearson’s correlation coefficient (r) defined as follows:
\(r=\frac{1}{n-1}\sum_{i=1}^n\frac{\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)}{S_x\cdot S_y}\)
where n is the number of joint damages to be scored, xi is the prediction, yi is the ground truth label created by the human scorer, x and y are the averages, Sx and Sy are the standard deviations. The secondary metric is RMSE defined as follows:
\(RMSE=\frac{1}{n}\sum_{i=1}^n\left(x_i-y_i\right)2\)
where n is the number of joint damages to be scored, xi is the prediction, and yi is the ground truth label. To evaluate the performance of a model, we performed 10-fold cross-validation experiments, where 90 percent of the data were used to build models and 10 percent of the data were held out for testing.
In addition to predicting joint damage, we also evaluated the accuracy of locating a joint. Since image sizes varied across individuals, we normalized the distance measurement during evaluation. Specifically, the coordinate of a point was first scaled to a continuous value between 0 and 1 by being divided by the height and width of an image. Then we calculated the distance between predictions and ground truth labels. Using this normalized distance instead of pixel values, images with different sizes were comparable.