Figure 4 illustrates the network architecture of the proposed CMD²A-Net. The coarse segmentation module outputs coarse lesion contour and also enables local feature extraction on lesion regions. Provided with more lesion features, the domain transfer module is introduced to facilitate feature alignment. A classifier module is incorporated for malignancy prediction. CMD²A-Net is trained on the three sequences (i.e. T2, ADC, and hDWI) individually. Based on the model output (i.e. lesions malignancy probability) of the three sequences, we can obtain the final malignancy predictions using ensemble learning. CMD²A-Net has two parallel branches with respect to (w.r.t.) the source and target domain, where two encoders extract features of prostate MR images separately in the two domains. The segmentors from the two domains share the same weights. The source segmentor is optimized by a supervised loss (i.e. coarse lesion segmentation loss). Samples and coarse mask labels from the source domain are required for training. The segmentation loss can be defined as
, (1)
where and indicate the pixel element values of mask label and predicted lesion map , respectively. Indices and denote the column and row of the image matrix in a dimension of w ×h . Constant value, (set to 10-5), is applied to avoid the zero-denominator case, as well as to guarantee numerical stability.

4.2. Attention-based Malignancy Estimation

In recent studies of prostate lesion classification (e.g., Guan, Liu, Yang, Yap, Shen and Liu [28]), lesion identification was suggested to be highly associated with disease-related regions in MR images. Instead of treating all pixels in the entire MR slice equally, an attention mechanism can be introduced to specifically extract lesion features. With these insights, we hypothesize that incorporating the prior knowledge of lesion regions into the DA process could enhance the model’s classification performance. As illustrated in Figure 4 , the two branches follow the same pipeline to generate attention feature maps. In each branch, the attention map can be produced using the prostate region and the coarse lesion mask, enabling our model to focus on the lesion region and also extract more lesion representations. The prostate region and the coarse lesion mask are denoted as and , respectively. Note that the subscripts “s ” and “t ” of variables (e.g., and ) inFigure 4 represent the source and target domains, respectively. The attention maps of source and target domains, and , respectively, can be calculated by:
(2)
where the operation means the element-wise product, and the sigmoid function is denoted by which is adopted as the nonlinear activation to generate attention maps. Such a simple but effective function can constrain each element of the feature maps in [0,1], thus weighting the importance of regions. As a result, guided by coarse mask labels, the lesion areas would be assigned higher weights than the non-informative background (i.e. healthy tissue) in the feature maps.
To achieve accurate lesion classification, features from the lesion attention maps can be extracted by an encoder, such that high-level lesion features can be captured for the classifier module. Thus, in each branch, an encoder is incorporated after the segmentor to extract each domain’s specific features. Besides, we propose to fuse the lesion features and the prostate features to boost the classification accuracy. Skip connection and concatenation operations are introduced to reuse prostate features from the segmentors.
We design a domain transfer module (in Figure 4 ) without requiring target labels in the training process. The semantics features from both the prostate region and attention map are fused, such that deep coral features from fully connected (FC) layers can be captured for feature affinity. Deep Coral loss [25] is employed to minimize cross-domain feature distribution discrepancy, owing to its generality, transferability, and ease of implementation. It is defined as the difference of second-order covariances between domains. Our domain transfer loss is defined as:
, (3)
where indicates the number of FC layers. Constants , are the weights that balance the contribution of FC layers, which are set to 1 here. The squared matrix Frobenius norm is denoted as . The dimension of the FC layer is indicated by . The feature covariance matrices of source and target domains, and , respectively, can be calculated by:
(4)
where denotes the number of images in the corresponding domain, and indicates the feature matrices of the corresponding FC layer, and1 is a column vector with all elements as 1.
To accomplish malignancy prediction using mpMRI, an ensemble learning approach is employed to fuse the predictions of the three separated models (w.r.t T2, ADC, and hDWI). We train the classifier module, as inFigure 4 , using labeled source data. The FC layers in the source domain are employed, not only for cross-domain feature affinity, but also for malignancy classification. The cross-entropy loss is utilized to optimize the classifier module. Our classification loss can be defined as:
, (5)
where variables and denote the ground truth and the malignancy prediction w.r.t. each source sample, respectively.
The ultimate purpose of CMD²A-Net is to accomplish accurate PLDC. To this end, we simultaneously train the coarse segmentation module, domain transfer module, and classifier module. Note that, minimizing segmentation loss alone would cause overfitting to the source domain, and only optimizing domain transfer loss would lead to generalization degradation in the target domain. Therefore, joint optimization on the total loss could facilitate the training process to reach equilibrium, such that the domain-invariant features could be extracted to achieve accurate classification. The total loss can be defined to:
, (6)
where and are weighting hyperparameters of the total loss. Both of them were set to 0.5 in our experiments.
To leverage the benefits of multiple sequences, we utilize the weighted average ensemble learning-based method. The outputs of the three separated models are incorporated, thus contributing to the final ensemble prediction as follows:
. (7)
where , , and are the malignancy probability predictions of T2, ADC, and hDWI, for which the weights are 1, and , respectively. Binary variables are assigned based on the availability of ADC and hDWI. For example, if the samples include ADC but without hDWI, and .

4.3. Implementation Details

Our models (i.e. Mask-RCNN model, CM-Net, and CMD²A-Net) were trained using a GeForce GTX 1080 Ti GPU (Nvidia, California, USA) with API Keras[43]. For the Mask-RCNN model training, data augmentation with random rotation was applied on the 646 T2 image slices on I2CVB. All the slices were split into training, validation, and testing sets in the ratio of 7:2:1. The input shape of Mask R-CNN was set to 512 × 512 pixels. Adam optimizer was applied with a learning rate of 10-3. The batch size was set to 4 and the total epoch was 200. During the training process, the model with the highest dice coefficient score on the validation set was retained. For CM-Net and CMD²A-Net training, the prostate regions from P-x, LC-A, and LC-B were scaled to 224 × 224 pixels. Random rotation of {±3°, ±6°, ±9°, ±12°, ±15°} was applied for data augmentation. Adam optimizer was chosen, and its learning rate was set to 10-5. The batch size was set as 2. In the training process of CM-Net, due to the limited sample size, all the slices were split into training and testing sets in the ratio of 4:1 using the hold-out method. The segmentation loss was optimized first to accelerate model convergence, and CM-Net with the pre-trained coarse segmentation module was further trained. In terms of CMD²A-Net, we initialized its both branches first using the weight of pre-trained CM-Net, in order to facilitate its convergence. To be specific, we trained both the coarse segmentation module and classifier of CM-Net first, with the combined samples from both domains. Then, we optimized the total loss of CMD²A-Net with labeled source samples and unlabeled target samples. By co-training all the modules, the model with the highest accuracy was saved for malignancy evaluation in the target domain.
We also offer our executable codes and files online available via GitHub, so as to allow any work extension or application by others. This open-sourced deep-learning-based model acts as an end-to-end system, input from prostate mpMRI sequences (i.e. T2, ADC, and hDWI), output to prediction results (i.e. prostate segmentation, coarse lesion detection, and malignancy estimation). The system supports multi-format inputs, including DICOM, jpeg, png, and jpg files. It is emphasized that no manual prostate segmentation or annotation is required.

Supplementary Figure 1