Methods:
Study population: We completed a retrospective chart review on
140 C-SCD, ages 6-19 years, followed at the Penn State Pediatric
Comprehensive SCD clinic between 2010-2020. PFTs (spirometry, IOS,
plethysmography, and DLCO) are typically obtained annually along with
pertinent laboratory data. We accessed the charts and extracted
demographic characteristics, anthropometric measures, PFT data,
pertinent laboratory results, and measures of clinical outcomes.
Control group: We identified 22 race‐matched children (African
American and Hispanic) without SCD from our patient pool, who performed
DLCO between 2018-2020, primarily due to dyspnea of unknown origin.
Children with pre-existing cardiovascular, hematological, oncological,
or pulmonary conditions that could affect DLCO were excluded. Since data
on total hemoglobin were unavailable for most control subjects, we
compared DLCO adjusted for alveolar ventilation
(DLCO/VA) between cases and controls (the rest of the
analyses in C-SCD were performed using hemoglobin-adjusted DLCO, as
described above).
Predictors of adjusted DLCO: DLCO was adjusted for hemoglobin
concentration and age using sex-specific predictive equations and
expressed as a percent of predicted (%pred)19. We
selected the following potential predictors of DLCO: 1) Pulmonary
function test estimates: PFT estimates representing obstructive and
restrictive airway disease were considered as potential predictors of
DLCO. Spirometry data included forced vital capacity (FVC), forced
expiratory volume in 1 second (FEV1), FEV1/FVC, and the forced
expiratory volume between 25th-75thof FVC (FEV25%-75%). Plethysmography data included total lung capacity
(TLC), vital capacity (VC), residual volume (RV), and RV/TLC. Spirometry
and plethysmography indices were expressed as
%pred (FEV1/FVC and RV/TLC were
expressed as a percent). NHANES III equations were used to calculate
%predicted values. Measures of total airway resistance (R5) and
reactance (X5, Fres, and AX) were obtained from the IOS reports and were
expressed as %pred using
Berdel/Lechtenbörger equations (except AX, which does not have standard
reference values)20. Subjects were instructed not to
take bronchodilator therapy for at least 12 hours prior to the PFTs. 2)
Laboratory values: the degree of anemia and biomarkers of hemolysis
(LDH, total bilirubin, reticulocyte count)– is known to be correlated
with SCD related complications. Systemic diseases, including liver and
renal function abnormalities, also known to affect DLCO. Neutrophilia
and renal failure has been reported as major predictors of death in
SCD5. Thus, we adjusted the study analyses for SCD
biomarkers, including a complete blood count (CBC) with differential,
fetal hemoglobin (HbF), and lactate dehydrogenase (LDH) levels, along
with liver and renal function test results (e-Table 1).
Indicators of disease severity and clinical outcomes: Number of
ACS has been reported to have an association with risk of early death as
early as age of 10 years in C-SCD5,21.Clinical
severity indicators considered in this study include lifetime number of
hospitalizations with ACS and VOC; sleep-related nocturnal hypoxemia
(defined as the percent of total sleep time spent with SpO2 of
<90%)22. Additionally, tricuspid
regurgitation jet velocity (TRJV) >2.5 m/s, measured by
echocardiography, was considered as a surrogate marker of pulmonary
hypertension23.
Statistical analyses: We used R (version 3.6.1) and SPSS
(version 26.0) for data analysis. DLCO estimates falling outside three
times the mean Cook’s Distance and two-standard deviation of Studentized
t-values were considered to be outliers and were excluded from further
analysis. We compared case and control groups with Mann-Whitney U-tests,
and used Pearson correlations to estimate the association between
potential predictors and DLCO. We added with bootstrap correction to
Pearson correlation to adjust for non-normality24.
Prediction models: Variables with a statistically significant
association with DLCO were then examined for relative strength
estimation using both a machine learning (ML) based tool, XGBoost, and a
linear mixed-effects regression model. XGBoost is a precise and
resourceful instrument that can be used for any type of regression
analysis or ranking of the predictors, as programmed by a user-built
prediction model25. We hypothesized that the ML tool
would perform better compared to linear regression since it can further
adjust for non-linear associations. Both models were adjusted for age,
sex, race, hemoglobin genotype as they affect pulmonary function in
children with SCD 26,27. Models were adjusted for
hydroxyurea, which increases HbF and improves clinical outcomes in SCD28; and asthma medications like LABA and ICS, which
can significantly elevate PFT estimates. Finally, models were also
controlled for the diagnosis of asthma (yes vs. no) since asthma is one
of the major comorbidities in C-SCD29. We built the
XGBoost model based on the five-fold cross-validation (CV) method.
Subjects were randomly divided into five equal groups; four of those
five groups were selected at a time as training data and the remaining
one as test data, and the process was repeated five times. Based on the
results, the predictors of DLCO were selected, and the algorithm was
built. We discuss further details in e-Appendix 1 .
Multicollinearity adjustment: We estimated the degree of
multicollinearity between different PFT indices based on simple linear
regression analyses by including all indices in the model with
hemoglobin-adjusted DLCO as the dependent variable. In this analysis,
FEV1(%) had a high variance inflation factor (VIF) of 5.92 and was
therefore removed from further analyses to minimize multicollinearity
and stabilize the standard error estimates30; the rest
of the predictor variables were included in the final models for both
XGBoost and regression analysis.
Ranking of the predictors: Predictors were ranked based on
their relative importance determined by “gain” measure in XGBoost and
by p-values in the linear mixed model. To quantify the performance of
both models in terms of predictive accuracy, we calculated the
mean absolute percentage error
(MAPE) and correlation coefficient between measured and eDLCO.
MAPE values <10% and
between 10%-20% are considered as ‘excellent’ and ‘good’ forecasting,
respectively31.
Association between DLCO and clinical outcome measures of SCD:To confirm the prognostic importance of DLCO, we analyzed its
association with SCD clinical outcomes using linear regression adjusted
for age and sex. For the correlational analyses between lifetime events
(numbers) of VOC/ACS and DLCO, we used the median values of DLCO for the
subjects with multiple data points. We also conducted correlation
analyses between DLCO and other disease severity indicators, including
TRJV and the degree of nocturnal hypoxemia. First, we examined measured
DLCO, and then we used our prediction models (XGBoost and mixed-effect
model) to calculate eDLCO, and further analyzed the association between
eDLCO values and outcome measures using linear regression to
cross-examine the accuracy and clinical relevance of the prediction
models.
Validation of the prediction model: Leave-one-out performance
(LOOP) cross-validation was used for the model
validation32. Using ‘LOOP’ function, predicted DLCO
was estimated for each study subject while the remaining data (111 in
this case) was used to train the XGBoost algorithm. This process was
repeated to predict DLCO for all of study participants. The forecast’s
strength was estimated with MAPE and the Pearson correlation coefficient
between observed vs. predicted DLCO.