Evaluating AI-Driven Segmentation for TNM Staging in NSCLC Using [18F]FDG PET/CT

In a retrospective single‑center cohort of 306 treatment‑naïve NSCLC patients, the top-performing autoPET III algorithm was evaluated for automated tumor and nodal contouring on diagnostic [18F]FDG PET/CT and its impact on TNM and UICC staging. The team assessed voxel overlap and lesion-level task metrics. The model achieved a mean Dice Similarity Coefficient of 0.64 and a lesion-level sensitivity of 95.8%, with overall UICC stage concordance of 67.7% when AI-derived masks were carried into clinical staging. The analysis paired standard overlap measures with a task-aware error taxonomy for deeper clinical context, highlighting that excellent raw detection does not guarantee staging concordance.
A DSC of 0.64 denotes moderate voxel overlap and reflects meaningful volumetric differences from manual contours. Only roughly two‑thirds (≈67.7%) of UICC stage assignments matched the reference, so voxel-level agreement did not reliably translate into patient-level staging. AI segmentation cannot yet substitute for expert readers when autonomous staging decisions are required.
The model’s high sensitivity came with a notable false‑positive burden, particularly for extrathoracic findings: precision for M-category lesions fell to about 74%, producing 196 false‑positive distant lesions in the cohort. That tendency inflates metastatic calls and frequently drives upstaging that would prompt systemic or palliative treatment changes, materially altering downstream management.