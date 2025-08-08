Fine-tuned GPT-3.5 -Turbo models are cost-effective experts as they fast track image analysis, still need supervision.
Going over detailed chest CT reports is essential for planning a patient's surgery, but there is a lack of enough radiology experts to do the work, and they are overwhelmed with the workload (1✔ ✔Trusted Source
Performance analysis of large language models in multi-disease detection from chest computed tomography reports: a comparative study: Experimental Research
Go to source). A study from Zhujiang Hospital of Southern Medical University examined 13,489 real-world chest CT reports and established that novel AI models may help with this issue by taking on some of this work, if they are given the right instructions.
‘Did You Know?''We discovered that modern language models can act as a dependable second set of eyes for radiologists,'' said Dr. Peng Luo, lead author and physician at Zhujiang Hospital. ''With carefully worded multiple-choice prompts, GPT-4 reached a 75 percent accuracy rate across 13 common chest diseases, ranging from COPD to aortic atherosclerosis.''
Prompt Engineering: Top Performers EmergeThe study contrasted five AI models (GPT-4, Claude-3.5-Sonnet, Qwen-Max, Gemini-Pro, and GPT-3.5-Turbo) using both open-ended and multiple-choice questions. The results showed that for all models, multiple-choice prompts enhanced accuracy and consistency, thereby highlighting the power of prompt engineering. GPT-4, Claude-3.5, and Qwen-Max were the top performers, while GPT-3.5-Turbo and Gemini-Pro had lower scores.
To probe whether weaker models could catch up, the researchers fine-tuned GPT-3.5-Turbo on 200 high-performing cases. ''Fine-tuning turned a 42 percent system into a 65 percent system overnight for tough pulmonary cases,'' Dr. Luo said. ''That's a game-changer for hospitals that rely on cost-effective models.”
Beyond raw accuracy, the study evaluated each model’s area under the ROC curve (AUC) for every disease. GPT-4 excelled at gallstone and pleural effusion detection, while Qwen-Max showed unusual strength in COPD discrimination. However, no single model dominated every condition, suggesting a tailored, disease-specific deployment strategy.
The authors caution that LLM outputs still require expert oversight, especially when a model expresses high confidence in borderline cases. Future work will integrate explainable-AI tools to reveal how models weigh radiologic clues and to set dynamic confidence thresholds.
