Deep learning-based classification for lung opacities in chest x-ray radiographs through batch control and sensitivity regulation

Imbalanced classification refers to the unequal distribution of data categories in the training dataset of a classification model and is a prevalent problem in data-driven deep-learning models. For example, the COVID-19 outbreak spread worldwide in mere months, affecting the lives and health of countless people. At the beginning of the outbreak, the CXRs of patients with confirmed cases of COVID-19 were available, but attempts to structure a deep-learning CXR model identifying COVID-19 cases with the unbalanced dataset would impede accurate performance assessment of the obtained model. In this study, we investigated BCM as a potential solution to imbalanced classification. The main methodological concept is the regulation of CXR model sensitivity by manipulating the data distribution of the data batches used for the training procedures.
In the dataset, the ratio between positive and negative cases was unbalanced (approximately 1:4). At the beginning of this study, we implemented the UNet model and trained it using a vanilla approach; we shuffled and created random batches of the dataset, fed these batches into the UNet model, and calculated the loss function for minibatch gradient-descent optimization. However, the result obtained in the preliminary stage was unsatisfactory and unstable. We then tailored the BCM to address the class imbalance problem. The BCM manipulates the distribution of positive and negative cases in each batch. For a batch size of six, we altered the number of positive cases from six to one and the number of negative cases from zero to five in each batch to create the P100 to P17 models. The vanilla UNet approach trained with a random data distribution produced the RAND model.
Following the class balanced test set (positive: 229, negative: 229 cases) and eight trials of training procedures, the P83 model (F1: 0.78) outperformed the other six models (F1: 0.62–0.77) in terms of F1-score. The other metrics of the P83 and RAND models were TPR = 0.80, FPR = 0.25, ACC = 0.78, and F1-CV = 0.012 for P83 and TPR = 0.46, FPR = 0.02, ACC = 0.72, and F1-CV = 0.135 for RAND. The F1-score of 0.62 for the RAND (the proportion of positive samples: 22%) approach was between those of the P33 (0.72) and P17 (0.60) models. The results indicate that the F1-scores were associated with the data distribution of training batches. In the three models (P33, P17, RAND), more negative cases were used in each iteration of network optimization. We anticipated that the networks would use a greater proportion of negative samples, which tended to result in more negative predictions. The TPR and FPR values from P100 to P17 were 0.93–0.44 and 0.77 to 0.01, respectively, which validate the association between data distribution and model performance. The results of P83, P66, and P50 presented higher F1-scores (0.75–0.78) than the other model did (0.60–0.72), indicating that the CXR model performed better when more than half of the training batch comprised positive samples. We can regulate the sensitivity of the CXR models by using the BCM to meet the requirements of different clinical environments. For example, if the identification of lung opacities in patients is a primary reason for CXR examination, a P83 BCM model may be suitable.
Across the eight trials of the training–testing procedures, the F1-CV values of the BCM models ranged from 0.011 to 0.043, and those of the RAND model ranged from 0.118 to 0.135, demonstrating that the BCM produced more stable results than did the RAND method. A fixed ratio of positive and negative samples resulted in a smoother loss function and led to better convergence in the BCM models. In our study, we investigated three networks: UNet, SegNet, and PSPNet; most of their performance metrics (e.g., TPR, and FPR) were similar except for the F1-CV. The average F1-CV values of the BCM models were 0.0236 (UNet), 0.0363 (SegNet), and 0.0345 (PSPNet), suggesting that the UNet method was the most stable.
In this study, a batch size of six was used because of the random access memory limit of the GPU (11 gigabytes in the GTX 1080Ti GPU card); therefore, the data distribution of the batch was limited to six variations, P100 to P17. With more GPU RAM, we may further improve the classification performance through the precise adaption of data distributions. For example, the result of P83 (positive:negative, P:N = 5:1) exhibited the best F1-score among the BCM models. We could produce P92 (P:N = 11:1) or P90 (P:N = 9:1) models to further optimize the CXR models. Supplement Fig. 1 displays our preliminary investigations on different batch sizes (batch size = 6, 9, 12, 18) and data ratios (P33, P66, P100). We can regulate the sensitivity of the CXR models by different batch sizes. Increasing the GPU RAM and the combinations in each batch merits further investigation.
In the machine learning methods, the learning procedures are generally biased towards the majority class because the classifiers aim to reduce global loss function. Therefore, the obtained model tends to misclassify the minority class in the dataset. To deal with the class imbalance problems, there exists several approaches in the machine learning field. At the data level, resampling methods, such as over-sampling, under-sampling or SMOTE15,16,17, generate a new dataset with adjusted data distribution. At the algorithm level, advanced loss functions, such as class rectification loss18 and focal loss19, taking the data distribution into loss derivation have been shown effective in the class imbalance problems. The BCM method proposed in this study can be considered as an implementation of the over-sampling method. The BCM method adjusts the proportion of positive samples in a batch and manipulates the over-sampling ratio during the training process. Combining the BCM method with advance loss functions may produce improved performance of the CXR models. Supplement Table 1 lists our preliminary comparison of cross-entropy loss and focal loss. The results suggested that RAND models trained with focal loss prominently improved the metrics of CXR classification. Future studies are necessary to validate the efficacy of the combined methods.
We used the segmentation network UNet for the classification application. The UNet architecture was used for the implementation of an encoder–decoder network with skip connections. In the field of deep learning–based pattern recognition, image classification can be also achieved with the encoder alone, followed by a fully connected output such as VGG1620, ResNet21, or DenseNet22. In addition, object-detection networks such as Faster Region-based CNN (R-CNN)23, YOLO24, or Mask R-CNN25 can be applied to CXR classification problems. We have not implemented or evaluated the BCM with these network architectures; this drawback is a limitation of this study. Nonetheless, we expect the BCMs to be advantageous because they are all based on convolutional networks. Further investigation of this theory is warranted.
In conclusion, we presented the deep-learning method as employed in the RSNA challenge for CXR recognition. To address the class imbalance of the RSNA dataset, we developed and evaluated the BCM. The models obtained using the BCM were more stable, and the sensitivity was adjustable through the manipulation of the distribution of positive and negative cases. Therefore, BCM is a practical method of producing regulable and stable CXR models regardless of whether the training dataset is imbalanced. The rapidly increasing number of confirmed COVID-19 infections continues to exert pressure on medical care systems and exhaust medical resources. As medical science researchers, we believe that global collaborative and investigative efforts will assist in overcoming this catastrophe.