LoRA-LLaMA-3 Classifies ASA Physical Status From Perioperative Data

Key Takeaways
- LoRA-LLaMA-3 reached a micro-F1 of 0.780 on hold-out testing, while XGBoost posted the strongest overall results on the main test set.
- Minority ASA classes were harder to classify, with lower macro-F1 and confusion concentrated outside the dominant ASA II and ASA III groups.
- Model rationales were usually coherent, although hallucinations were not fully eliminated, and the authors outlined a read-only preoperative-clinic workflow as one possible deployment approach.
The system was trained on reformatted clinical narratives derived from structured and unstructured perioperative records from 24,491 surgical patients. The comparison focused on ASA-PS assignment from existing clinical records rather than prospective workflow use. XGBoost delivered the strongest overall results in the main model comparison.
This retrospective single-center study was conducted at Far Eastern Memorial Hospital using records collected from November 21, 2015, to August 1, 2023. The cohort included adults undergoing surgery with general or neuraxial anesthesia, with ASA I through V retained and ASA VI excluded. The final analytic sample included 24,491 patients, split by stratified sampling into 17,143 training cases, 2449 validation cases, and 4899 test cases. Inputs combined preoperative anesthesia notes, discharge summaries, and structured perioperative variables reformatted into Alpaca-style instruction-response prompts, and all comparisons used the same fixed hold-out test set.
In comparisons with random forest, XGBoost, support vector machine, fastText, BioBERT, ClinicalBERT, and untuned LLaMA-3, the LoRA model achieved an MCC of 0.533, AUROC of 0.863, and AUPRC of 0.653 on the main test set. XGBoost led the comparison with a micro-F1 of 0.815 and MCC of 0.613, with AUROC and AUPRC of 0.884 and 0.701. The cohort was heavily weighted toward ASA II with 15,272 cases and ASA III with 8024 cases, while ASA I, IV, and V included 535, 606, and 54 cases. LoRA-LLaMA-3 had a macro-F1 of 0.316, and neither generative oversampling to 3000 cases per class nor inverse-frequency reweighting improved overall performance.
For explainability, blinded review of 40 held-out cases found no hallucination in 86.25% of rationale ratings and clinical coherence in 91.25%. The sample was stratified across ASA I through IV, and independent board-certified anesthesiologists performed the ratings. Hallucinations were not completely eliminated, so the authors cautioned that explanations should still be interpreted carefully. Attention analyses also highlighted age and comorbidity terms such as diabetes, cirrhosis, and stroke, although these patterns were treated as exploratory. In the subset of 694 cases with 512 tokens or fewer, LoRA-LLaMA-3 reached a micro-F1 of 0.879 and MCC of 0.297. Ablation experiments found that dropout 0.4 yielded the best performance, while dropout 0.3 was retained for subsequent experiments; other retained settings included learning rate 3×10-5, temperature 0.1, and top-p 0.1.
The authors described a read-only preoperative-clinic workflow that would display a predicted ASA-PS class, provide a rationale, allow one-click confirmation, and capture feedback. Reported computational latency averaged 2.25 seconds per case. Limitations included the retrospective single-center design, subjective ASA labels, marked imbalance in ASA I, IV, and V, normal-value imputation for missing laboratories, residual hallucination risk, and lack of real-time usability evaluation despite reported computational efficiency. The work provides an internally validated framework for adjunctive ASA-PS classification rather than evidence of replacing clinician judgment or establishing external generalizability.