Deep neural network trained on gigapixel images improves lymph node metastasis detection in clinical settings

AI-assisted LN assessment workflow
The clinical workflow interface is displayed in Fig. 1. Upon the import of a WSI, the LN detector is triggered to outline the LNs, after which the LN metastasis identification module classifies each LN as positive or negative and highlights the lesion area. To address false predictions, pathologists can edit contours, contour labels, or amend the final counts for correction. To assist pathologists with N-category assessment, a panel summarizing the numbers of positive and negative LNs of the current slide and study was employed. The evaluation of the ESCNN is presented as follows, followed by a discussion of the proposed weakly supervised end-to-end training method for metastasis identification and the clinical evaluation of the AI-assisted workflow. A demo video is provided as Supplementary Movie 1.
a Description of the data sets in this study, including the number (P: positive, N: negative) of studies, slides, and LNs; the distributions of age, sex (F: female, M: male), and the Lauren classification (IT: intestinal type, MT: mixed type, DT: diffuse type), the grading of gastric cancer, and AJCC T and N categories. b Pipeline for training the LN detector and the metastasis identification module. c The inference pipeline leverages the trained models to provide prediagnostic predictions of positive and negative LN counts and highlights suspicious areas. d Schematic of the workflow.
ESCNN performance in metastasis identification
The experiments were conducted using the main training set, which consisted of 983 WSIs including 5907 LN images collected from Linkou CGMH in 2019. Each LN image was downscaled to 20× magnification (0.46 µm/pixel) and padded to dimensions of 75,000 × 75,000 pixels. The metastasis identification model of the ResNet50 architecture26 was trained using the ESCNN in an end-to-end, weakly supervised manner. The model was then tested using the main test set of 1156 LN images (positive: 295; negative: 861) collected by Linkou CGMH in 2019. The ground truth of each LN image was reviewed by four pathologists (S.-C.H., J.L., H.-C.C., and T.-Y.H.) and meticulously examined by the most experienced pathologist (S.-C.H., an expert in gastric cancer pathology) with the assistance of immunohistochemistry (IHC) testing. The model achieved an area under the receiver operating characteristic curve (AUC) of 0.9831 (0.9728–0.9934) for the classification of LN images. After the LN prediction scores were aggregated according to their maxima, the slide-level AUC reached 0.9936 (0.9856–1.0000), comparable to the slide-level AUC of 0.986 of a patch-based model trained with 700 fully annotated WSIs15. Thus, the effectiveness of weak supervision with less annotation effort was demonstrated. To investigate the impact of lesion sizes on the model performance, two subsets of the main test set were established. Each comprised all 861 negative LN images. One contained the 58 positive LN images demonstrating only micrometastasis (≥0.2 mm, <2 mm), and the other contained the 28 positive LN images demonstrating only ITCs (<0.2 mm). The model achieved AUCs of 0.9940 (0.9892–0.9988) and 0.9228 (0.8643–0.9814) on the micrometastasis and ITC test subsets, respectively. The results indicated that the ITC identification accuracy of the model remained to be improved (Fig. 2a).

The rows present a comparison of (a) the performance of our model with those of three pathologists before and after receiving AI assistance on the 48 equivocal cases; (b) weakly supervised methods; (c) performance obtained under various magnification levels (20×: 0.46 µm/pixel, 10×: 0.92 µm/pixel, 5×: 1.84 µm/pixel); and (d) performance obtained under different label types and amounts of training data. We evaluated each method with the main test set and its subsets to retrieve the ROC curves. The first column presents the ROC curves differentiating between the 1156 LNs in the main test set. Among the 1156 LNs, those marked as micrometastases and ITCs, as well as all the negative LNs, were sampled to evaluate the performance of the model in identifying micrometastases and ITCs, as presented in the second and third columns. The fourth column displays the slide-level performance.
To prepare the model for practical use, a threshold of prediction scores was set such that the model could generate a concrete prediction regarding whether an LN was positive or negative. In the general context, the threshold was set as 0.4 to balance positive and negative predictions, which acquired a relatively high Matthews correlation coefficient (MCC; a reliable confusion matrix metric27) score on the validation set compared to other thresholds. Under a threshold of 0.4, the model achieved a sensitivity of 0.8915 (0.8503–0.9246), a specificity of 0.9861 (0.9758–0.9928), and an MCC of 0.8986 (0.8686–0.9269) on the main test set. These results are comparable to those of patch-based methods14,15 (MCCs: 0.8937 and 0.9334). Table 1 and Supplementary Table 1 presents the performance of our model, the pathologists, and previous models14,15 under the main test set as well as the micrometastasis and ITC test subsets. On the other hand, in the clinical context, where AI is used to screen suspicious LNs, a more sensitive threshold of 0.15 was employed.
Comparisons with other weakly supervised methods
The results indicated that the proposed ESCNN was accurate in identifying gastric LN metastasis under weak supervision. We next focused on investigating the performance of other alternatives. Most weakly supervised methods adopt a two-stage architecture: patch feature extraction and interpatch aggregation. MIL typically involves the use of a first-stage CNN for making patch predictions, followed by a second-stage selection of the most suspicious patch with the highest prediction score (i.e., through max pooling) to represent the entire slide. In an MIL–recurrent neural network (MIL-RNN) architecture17, instead of max pooling, an RNN is employed to aggregate the patch embeddings of top-scoring patches. Under the clustering-constrained attention MIL (CLAM)18 approach, a pretrained CNN is leveraged to extract patch embeddings in the first stage. The second stage involves a clustering-constrained attention module. Under the main training set, MIL, MIL-RNN, and CLAM yielded AUCs of 0.9449 (0.9265–0.9634), 0.9475 (0.9297–0.9653), and 0.9323 (0.9120–0.9527) for LN image classification, respectively, and AUCs of 0.9687 (0.9462–0.9912), 0.9704 (0.9493–0.9914), and 0.9649 (0.9393–0.9905) for WSI classification, respectively. As shown in Fig. 2b, the proposed ESCNN model (AUC for LN image classification: 0.9831) empirically outperformed these alternatives (P < .001).
Impact of image resolutions, data set size, and label types on ESCNN performance
Aside from two-stage weak supervision, end-to-end training methods such as streaming CNN24 and the whole-slide training method22 also demonstrate favorable classification performance on tasks performed under low magnification. However, image resolution under these methods is constrained by the prohibitively low throughput and high memory consumption. Because all these end-to-end methods are logically equivalent, we used the ESCNN to evaluate their performance when applied to downsampled WSIs (5× and 10× magnification). The AUCs of ESCNN models trained using LN images magnified 5× (resolution: 1.84 µm/pixel) and using LN images magnified 10× (0.92 µm/pixel) were 0.9580 and 0.9790, respectively, lower than the AUC of 0.9831 obtained using LN images magnified 20× (P = .001 and .35). Further analysis revealed that the ability to identify micrometastases became saturated after application to micrometastasis subset images magnified 10× (AUCs corresponding to LN images magnified 5×, 10×, and 20×: 0.9748 vs. 0.9938 [P < .001] vs. 0.9936 [P = .35]). By contrast, the ability to identify ITCs from the ITC subset improved continually with increasing image resolution (0.8103 vs. 0.8861 [P = .044] vs. 0.9228 [P = .13]). The benefits conferred by 10× magnification were significant. However, the benefits of 20× magnification required the verification of more ITC samples (Fig. 2c).
We also evaluated the impacts of data set sizes and the label types (slide level or LN level) on identification performance. Under training with LN-level labels and images magnified 10×, the identification performance was enhanced after a larger data set was input (AUC of LN image classification: 0.9343, 0.9493 [P = .032], and 0.9790 [P < .001] for the truncated 869-LN-image, truncated 1918-LN-image, and full 5907-LN-image data sets). The results suggest that the input of more training data improved model performance. Under training with only slide-level labels, the AUCs of LN image classification obtained using 983 and 1700 WSIs were 0.9194 and 0.9606, respectively. Notably, regardless of the label type, the model performance corresponded relatively well to the number of labels. The two models trained using 869 LN images and 983 WSIs achieved comparable results (LN-level AUC = 0.9343 vs. 0.9194, P = .13), as did the models trained using 1918 LN images and 1700 WSIs (LN-level AUC = 0.9493 vs. 0.9606, P = .11). In short, when the total number of slides is limited, LN-level labels are recommended for enhancing model performance (Fig. 2d).
Throughput and memory consumption
Despite the logical equivalence of these end-to-end training methods, the vast computational and memory overhead involved precludes the handling of high-resolution tasks. As presented in Fig. 3, we examined the throughputs and memory consumptions of these approaches under various input image resolutions (4688 × 4688 [1.25×], 9375 × 9375 [2.5×], 18,750 × 18,750 [5×], 37,500 × 37,500 [10×], and 75,000 × 75,000 [20×]) by using 100 randomly sampled LN images from the main training set. Among these image resolutions, an original ResNet5026 model can undergo direct end-to-end training only on 1.25× images (memory consumption: 19.1 GB) due to limited GPU memory capacity (NVIDIA Tesla V100 with 32 GB of random-access memory [RAM]). The whole-slide training method22 leverages CUDA Unified Memory to enable the excessive amount of intermediate data stored in GPU memory to be offloaded to host memory through data swapping. Although host memory is 10×–100× larger than GPU memory on a typical GPU server, this method can train a 5× model (memory consumption: 618.9 GB) on a server with 768 GB of system memory at best. Moreover, the throughput was considerably hindered (0.153 images per minute for training on 5× images) by the overhead incurred by GPU–host memory data transfer. By contrast, the streaming CNN and ESCNN methods reduced the amount of intermediate data, such that the memory consumption for model training remained between 8 and 9 GB regardless of the image resolution. This ensured that all the intermediate data could fit into the GPU memory, thus obviating the need for Unified Memory. Without the data swapping overhead, the training throughputs of streaming CNN and ESCNN on a 5× model training were 1.49 and 3.48 images per minute, which were 9.74× and 22.7× faster than the whole-slide training method, respectively. When trained on 20× images, the ESCNN approach obtained a training throughput of 0.912 images per minute, which was 9.83× faster than the 0.0928 images per minute achieved under the streaming CNN method. This improvement is attributable to the patch-based image augmentation (2.31× speedup) and skipping mechanism (4.26× speedup) under the ESCNN approach.

Each panel represents the (a) training throughput, (b) inference throughput, (c) training memory consumption (referring to Unified Memory for the whole-slide training method and GPU memory for the others), and (d) inference memory consumption. For each setting, we recorded the training/inference time and memory consumption when processing each LN image (n = 100 images in total, sampled from the main training set). Each box-and-whisker plot comprises the center (median), the bounds of boxes (Q1 and Q3), the bounds of whiskers (the minimum and maximum within the range, obtained by adding the median to ±1.5 times the Q3–Q1 distance), and the outliers of the underlying 100 samples. The absence of certain boxes indicates that those settings could not be run due to memory shortages.
Lesion highlights and qualitative analysis
The model highlighted metastatic tumor areas for rapid verification through class activation mapping (CAM)28. In quantitative analysis, the saliency maps generated by the algorithm achieved an Intersection over Union (IoU) of 0.5934 (at a threshold of 0.5) and a pixel-level AUC of 0.8495 in five detailed-annotated WSIs sampled from the main test set, demonstrating high correspondence between the predicted and actual lesion areas.
As displayed in Fig. 4, the CAM results of our model exhibited a higher coverage of macrometastases and micrometastases, and the ability to localize ITCs, compared to the other methods. Furthermore, CAM was employed to investigate the sources of false predictions of our model (Fig. 5). Specifically, 24 false-positive slides were reviewed. Slides with artifacts (including cautery, crushing, and floater artifacts; 5, 21%) and histiocytic aggregates (3, 13%) may have misled the model. No common patterns were found in the remaining 16 false-positive slides. Within the reviewed 13 false-negative slides, the metastatic foci were mostly ITCs (11, 85%) and micrometastases (2, 15%). As for morphology, most cases were classified as diffuse or mixed-type adenocarcinoma (10, 77%), characterized by low numbers of dispersed ITCs that may have resembled sinus histiocytes in appearance. The remaining cases (3, 23%) corresponded to intestinal-type adenocarcinoma, in which ITCs with clear cytoplasm were observed.

Each panel displays an example of an H&E-stained LN image, the reference standard of the metastatic area under IHC staining (cytokeratin AE1/AE3), the heat map for lesion localization generated by our model through CAM, the heat map generated by a 5× low-resolution ESCNN model, the attention map of a CLAM model, and the prediction map of a MIL model. Identified metastatic tumor cells are highlighted in brown and red in the IHC stains and heat maps, respectively. Examples of (a) macrometastasis, (b) micrometastasis, and (c, d) ITCs identified from the main test set demonstrated the high correspondence of the model-predicted area with the IHC-predicted area, where (d) displays the high-power field of the green boxes in (c). Aside from the displayed examples, the localization performance of our model remained at the same level for the 263 LN images from the main test set that were correctly classified as metastases.

Each panel displays an example of an H&E-stained LN image (left), the reference standard of the metastatic area under IHC staining (cytokeratin AE1/AE3; middle), and the heat map for lesion localization generated by our model through CAM. a Example of false-negative cases indicating histiocyte-like metastasis, which tended to mislead our model. The main test set contained 10 similar samples. b Histiocyte-like diffuse-type ITCs in sinusoids were challenging for both the model and the pathologists; their accurate detection may require IHC slides. c Slide showing histiocytic aggregates in sinusoids and unusual blue proteinaceous fluid that caused our model to issue a false alarm, which led five of the six pathologists to incorrectly interpret the slide as positive. The incorrect highlight of histiocytic aggregates appeared in three samples. d Floater misidentified by our model and five of the six pathologists under AI assistance as metastatic adenocarcinoma. Five slides with artifacts (including cautery, crushing, and floater artifacts) were misidentified.
Comparisons with pathologists and a pilot study of AI assistance
The main test set was reviewed by four pathologists. The classification of each LN image by each pathologist was examined. The MCCs of the four pathologists (0.9497–0.9818) exceeded that of the model (0.8986). Notably, regarding the 48 LN images for which consensus among the four pathologists was not reached (sensitivity: 39.4–81.8%; specificity: 26.7–86.7%), the model exhibited a relatively high sensitivity of 69.7% and specificity of 86.7%. In other words, AI assistance was helpful under this equivocal situation. To confirm this premise, three pathologists (J.L., H.-C.C., and T.-Y.H.) were asked to double-review the 48 equivocal LN images by using the AI-assisted LN assessment workflow. Overall, 42.4% of the previous labels were changed, and the performance of all three pathologists improved (MCCs without assistance: 0.9497–0.9589, MCCs with assistance: 0.9795–0.9863) to the level of that of the expert pathologist (S.-C.H.; 0.9818). Therefore, we proceeded to conduct a formal study for validating the clinical impact of the AI-assisting workflow in terms of review time, accuracy, and count consistency.
Clinical impact of the AI-assisted LN assessment workflow
As mentioned, the assessment workflow included an LN detector. Trained using the main training set of 5907 LN images, the DeepLabv3 + -based29 LN detector achieved an IoU of 0.8473 and a pixel-wise accuracy of 92.83%. Six pathologists (J.L., T.-Y.H., H.-C.C., K.-H.C., R.-C.W., and Y.-J.L.) were recruited to review 80 slides with and without AI assistance, with a 2-to-3-week washout interval. The slides, sampled from the archive of Linkou CGMH in 2020, comprised 19 negative slides, 24 slides with macrometastasis (≥2 mm), 24 slides with micrometastasis (<2 mm, ≥0.2 mm), and 13 slides with ITCs (<0.2 mm).
As indicated in Fig. 6a, the workflow significantly shortened the review time of the pathologists (per-slide median: 161.2 to 110.5 s, −31.5%, P < .001). The review time also significantly decreased for negative slides (178.9 to 127.7 s, −28.6%, P < .001), slides exhibiting macrometastasis (166.7 to 112.2 s, −32.7%, P < .001), slides exhibiting micrometastasis (142.2 to 99.4 s, −30.1%, P < .001), and slides exhibiting ITCs (139.7 to 103.4 s, −26.0%, P = .005). AI assistance accelerated the classification of most cases. In a few negative cases, however, the review time increased slightly. As presented in Fig. 6b, mixed-effect modeling revealed that AI-attributable false alarms affected the time taken to review negative LN images (median time for slides with and without false alarms: 146.3 vs. 119.1 s, respectively; P = .047). In short, the AI-predicted false positives prompted the pathologists to scrutinize the slides of interest more thoroughly, thus increasing the review time. However, the time taken remained shorter than that under no AI assistance.

a Per-slide review time with and without AI assistance. Macro and Micro are the abbreviations of macrometastasis and micrometastasis respectively. b Impact of AI-attributable false alarms on review time, assessed by comparing the time taken to review negative slides for which false alarms were issued with the time taken to review negative slides for which false alarms were not issued. c Accuracies (i.e., specificities for negative slides and sensitivities for positive slides) achieved with and without AI assistance. d Impact of AI-attributable false alarms on specificities. e CVs per slide, calculated using the positive LN classifications of the six pathologists to quantify the interrater reliability of positive LN counts. f CVs per slide of negative LN classifications. The box-and-whisker plots in (a), (b), (e, f) comprises the center (median), the bounds of boxes (Q1 and Q3), the bounds of whiskers (the minimum and maximum within the range, obtained by adding the median to ±1.5 times the Q3–Q1 distance), and the outliers. The numbers within the boxes are the medians. The centers and error bars in (c, d) represent the sensitivities (or specificities) and the 95% confidence intervals, respectively.
Regarding the accuracy of reported positive LN slide (with a classification of positive meaning that at least one positive LN was detected), under AI assistance, the slide-level sensitivity increased significantly from 81.94% (79.25–92.18%) to 95.83% (91.15–98.46%, P < .001) for the slides exhibiting micrometastases. For the slides exhibiting ITCs, the slide-level sensitivity improved significantly from 67.95% (56.42–78.07%) to 96.15% (89.17–99.20%, P < .001). The sensitivity corresponding to the slides exhibiting macrometastasis remained at the same high level without (99.31% [96.19–99.98%]) and with (100.0% [97.47–100.0%], P > .99) AI assistance (Fig. 6c). As displayed in Fig. 6d, in some cases, false alarms in negative slides resulted in the pathologists reporting false-positive results, causing the specificity to drop from 93.86% (87.76–97.50%) to 84.21% (76.20–90.37%, P = .019). All but one false alarm (16/17) was concentrated in 3 of the 19 negative slides. They misled five to six of the pathologists. The class activation maps of these slides highlighted regions containing tightly aggregated histiocytes with unusual blue proteinaceous fluid, increased numbers of high endothelial venules, and unintentionally introduced floater artifacts, respectively (Fig. 5).
The counts of positive LNs differed among the pathologists. This is ascribable to inconsistent diagnoses of LN metastasis and to variations in subjective distinctions of LN and non-LN tissue. Under AI assistance, the consistency of positive reports, as indicated by the coefficient of variation (CV; a lower value is desirable), increased significantly (median: 0.3499 to 0, P < .001) in all the positive categories, namely macrometastasis (0.1775 to 0, P < .001), micrometastasis (0.3651 to 0, P < .001), and ITCs (0.6388 to 0.1113, P = .014; Fig. 6e). The consistency of negative reports increased as well, but not as markedly (Fig. 6f).
Cross-site evaluation
After the assessment of both the performance of the ESCNN model and the clinical experiment using slides from Linkou CGMH, we validated the robustness of the workflow. Specifically, we applied the 20× ESCNN model to the 327 slides collected from Kaohsiung CGMH between 2019 and 2021 with the 2088 LN images annotated by S.-C.H., J.L., H.-C.C., T.-Y.H., and K.-H.C. Regarding the cross-site performance of metastasis identification, AUCs of 0.9868 (0.9784–0.9952) and 0.9829 (0.9652–1.0) were achieved for the classification of LN images and WSIs, respectively. These AUCs were not significantly different from those of the main test set (0.9831 [P = .59] and AUC = 0.9936 [P = .29], respectively). The IoU of 0.9044 of the LN detector (vs. 0.8522 on the main test set) also indicated high model generalizability.