Interpretation
The AAGL system accurately predicted surgical complexity level in 66.2%
of cases, which is comparable to the 69.2% found in the original paper
(2). In our study the overall agreement between AAGL stage and AAGL
complexity level was weak, as quantified by a weighted kappa score of
0.38 – 0.42 across the three observers. This was low, compared with
0.621 in the original study (2), which suggested moderate agreement.
Stage 1 performed reasonably well at predicting skill level A and this
was consistent across the three observers, however the remaining stages
2, 3 and 4 did not correlate well.
The pre-specified AAGL cut-points had reasonably high specificity for
discerning skill level A/B/C versus D (stage 4) but low specificity for
A versus B/C/D and A/B/C versus D (lower levels). When AUCROC data in
this external validation are directly compared to those reported in the
paper by in the original paper (2), the results are less robust. For A
versus B/C/D, AUCROC in the original paper was 0.98, and in our
analysis, it was lower at 0.75 to 0.89. For A/B versus C/D, AUCROC in
the original paper was 0.95, and in our analysis, it was lower at 0.81.
For A/B/C versus D, AUCROC in the original paper was 0.91, and in our
analysis, it was higher at 0.95 to 0.96. This may reflect the fact that
in the original paper, regression analysis was used to identify optimal
cut points for that particular dataset, so the performance would
therefore be expected to be less promising when externally validated.
Poor diagnostic accuracy for levels 2, 3 and 4 and lower than previously
reported AUCROC results in our dataset suggest that the AAGL staging
tool is not be generalizable in its current form.
While stage 4 had a low PPV for predicting surgical complexity level D
(47.5%), the specificity (91.7%) and PPV (99.57 %) were high. This
demonstrates that stage 4 performs well at ruling out those without
lower surgical complexity levels. The AUROC for stage 4 to discriminate
level D from levels A/B/C was high at 0.95, which confirmed this
finding. These results suggest that the tool might be useful for
surgical planning, although if the stage can only be determined
intraoperatively, the utility of this is limited.