The results of separate models from P-x, LC-A, and LC-B are shown inTable 2 . For the three sequences (i.e., T2, ADC, and hDWI), the
AUCs of three separate models are relatively high when tested within
their domains, but the AUCs sharply drop when directly tested in the
unseen domains. Such results show the sensible cross-domain discrepancy
(i.e. domain shift) among the four datasets. Note that, in terms of the
T2 sequence, separate models of LC-A and LC-B accomplish the highest
testing AUCs (0.66 and 0.67) in the unseen domain, LC-C, just marginally
higher than the ones (0.61) within their corresponding domains. A
potential reason for the biased predictions is the deficiency of testing
samples (i.e. 29) on LC-C. When it comes to the joint models in the
table, they cannot bring remarkable improvements in each sequence
compared with
the
separate models, instead, even may lead to performance degradation due
to cross-site heterogeneity.
With severe discrepancies among our datasets, we intend to validate
whether the rigorous MR image preprocessing methods can contribute to
the joint models’ classification performance. Similar to scaled,
whitening is another common preprocessing method, capable of normalizing
the pixel values with a mean of zero and a variance of unit. Taking the
combined dataset, P-x and LC-A, as a representative for evaluation. InTable 3 , scaled, whitening, and their combined function with
bias field correction (BFC) or noise filtering (NF), 6 preprocessing
methods in total, were adopted as in [35]. The
joint models using scaled and whitening acted as the two baselines for
comparisons with the rigorous MR image preprocessing methods (i.e. BFC
and NF). Figure 1 depicts the image preprocessing examples of
three methods (i.e. whitening, whitening + BFC, and whitening + NF). The
left and right halves of each sample represent before and after
preprocessing, respectively. Before preprocessing, we can observe
noticeable intensity distribution discrepancies on the samples. The
samples from LC-A are characterized by larger numbers of low-intensity
grayscale pixels as compared with the images of P-x. Subsequently, the
jet color maps were employed to highlight the intensity distribution
between domains after preprocessing. All the color maps share the same
intensity color scale. Similar intensity distributions can be found
among the samples after preprocessing, demonstrating the effectiveness
of the methods in image distribution harmonization.
In Table 3 , for the T2 sequence, BFC with either scaled or
whitening outperforms the baselines. Besides, BFC with whitening
achieves best AUCs of 0.91 and 0.80 on P-x and LC-A, respectively.
However, these findings are not consistent with the results in ADC and
hDWI. In terms of ADC, the models preprocessed with BFC or NF
underperform the baselines. Instead, the baseline models receive the
highest AUCs, where scaled alone and whitening alone accomplish 0.73 and
0.72 on P-x and LC-A, respectively. When it comes to the sequence of
hDWI, either BFC or NF attributes limited improvement over the
baselines. On P-x, the AUC increases marginally from 0.73 (scaled only)
to 0.80 (scaled with NF); on LC-A, only an AUC of 0.65 is achieved using
scaled with BFC. The above results of the three sequences show that
these pre-processing approaches could improve CM-Net’s classification
performance when combing our two datasets. However, none of the methods
is capable of boosting the joint models’ generalization considerably, as
compared with the separate models of P-x and LC-A (in Table 2 ).
This indicates that the preprocessing methods are probably insufficient
to solve domain shift fundamentally.
A possible reason is that the
severe discrepancies do not come from the inter-site discrepancies (inTable 1 ), rather than the intensity distribution of the
heterogeneous mpMRI sequences only (see details in SupplementaryFigure 2 ).
2.3. Cross-domain Malignancy Classification and Lesion Detection
We emphasize the importance of knowledge transfer from a large-scale
publicly dataset to a small-scale target domain. The malignancy
estimation performance of CMD²A-Net (the architecture is shown inFigure 4 and described in detail in the Methods section) is
evaluated. Dataset, P-x, is only regarded as the source domain. Either
LC-A or LC-B is also set as the source domain for knowledge transfer
between local cohorts. The scaled method was employed for image
preprocessing. In general, available types of MR sequences may vary in
healthcare institutions. Thus, we employed ensemble learning to handle
multiple sequences, allowing the use of single and multiple sequence(s)
in our framework. Three common metrics were adopted for classification
performance evaluation, i.e. AUC, sensitivity (SEN), and specificity
(SPE).
Table 4. Malignancy classification results in the target
domains in four combinations of source-target domain.