MARCH Multi-Agent Hierarchy for CT Report Generation

Key Takeaways
- MARCH was reported to outperform baseline report-generation systems across the listed text-generation and clinical-efficacy metrics.
- Stronger clinical efficacy was observed across most predefined findings, with clearer gains in minor abnormalities such as hiatal hernia and pericardial effusion.
- Removing consensus-driven finalization produced the largest ablation decline, and the authors also noted reliance on GPT-series backbones, no long-term memory, and no human-in-the-loop interface.
The system assigns a Resident agent to draft reports, Fellow agents to revise them with retrieved evidence, and an Attending agent to finalize consensus. The hierarchy is designed to mirror radiology review workflows while improving grounding and completeness in generated reports. Performance exceeded baseline methods on the reported benchmarks, with especially notable gains in minor findings.
Evaluation used the RadGenome-ChestCT benchmark, which includes 25,692 chest CT scans from 21,304 patients. The investigators followed the official split, with 24,128 scans for training and 1,564 for testing. Report quality was measured with BLEU, ROUGE-L, and METEOR, alongside a Clinical Efficacy score based on 18 predefined abnormalities. Within that framework, retrieval-augmented revision drew on image-to-image, image-to-text, and logits-based matching, while a classification head predicted the same 18 canonical findings. This provided a consistent basis for comparison with baseline report-generation systems.
Across the benchmark, MARCH outperformed baseline report-generation methods on the listed text-generation and clinical-efficacy measures. Clinical efficacy was stronger across most abnormalities, and the abnormality-level analysis used the Resident Agent as a reference baseline. The clearest differences appeared in minor findings, including hiatal hernia and pericardial effusion, where the investigators observed more favorable performance. The authors framed these results as evidence that the multi-agent process may limit omissions and hallucinated content within the reported dataset.
Component testing showed the largest performance decline when consensus-driven finalization was removed from the hierarchy. The Resident agent combined multi-scale CT feature extraction with multi-region segmentation, and each retrieval agent pulled the top three similar cases. Retrieval spanned image-to-image, image-to-text, and logits-based matching, giving Fellow agents structured material for revision. The implementation used GPT-4.1 for Fellow agents and GPT-4o for the Attending agent, with temperature set to zero; the Resident and Retrieval agents were trained on a single H100 GPU. In this ablation analysis, consensus-driven finalization appeared to be the most influential single component.
The evaluation was limited to a single benchmark and did not include external validation or deployment evidence. The authors also noted dependence on GPT-series backbones for multi-agent reasoning, leaving generalizability to other medical language models unresolved. Additional limitations included the absence of long-term memory for longitudinal history or learning from past diagnostic errors, and no human-in-the-loop interface. No adverse events or patient-level safety outcomes were described, and the reported gains remained bounded to this dataset and evaluation design.