Enhancing Radiology Workflow with AI: An Exploration of LLMs in Triage Systems

In a study evaluating large-language model (LLM) triage of high-priority radiology reports, the Llama3 Elyza 8B model achieved PRAUC 0.962 and ROCAUC 0.968 on a balanced test set, enabling faster identification and routing of unexpected findings.
The evaluation used a balanced dataset (1,906 reports for training and 176 for testing) to fine-tune several 8B-class LLM variants to predict whether a report qualified as high-priority based on unexpected or urgent findings. Performance was strong (PRAUC 0.962, ROCAUC 0.968, accuracy 0.915, sensitivity 0.932, specificity 0.898), with high triage recall and a low miss rate while keeping false alarms at an operationally acceptable level.
Input-scope experiments showed that providing the model with the report findings plus the referring department produced the best discrimination. Adding clinical diagnosis text before the exam or extra request details did not reliably improve performance.