The LLMs studied show competence comparable to that of medical specialists in the interpretation of clinical reports, even in complex and confusingly worded texts, and represent a preferred option over human analysis for data mining and structuring information in extensive sets of clinical reports.
Objectives: Explore the potential of Large Scale Language Models (LLMs) to improve the organization of data in clinical reports, with a particular focus on the identification and classification of history and comorbidities in the electronic medical records of oncology patients in the Hospital. Virgin Macarena of Seville. We specifically evaluate the gpt-3.5-turbo-1106 and gpt-4-1106-preview models against the capabilities of specialized human evaluators. Methods: We implemented a script using the OpenAI API to extract, in JSON format, structured information on comorbidities reported in 250 personal history reports. These reports were manually reviewed in groups of 50 by five radiation oncology specialists. A detailed analysis of the discrepancies between GPT and medical models allowed the ground truth to be established. We compare the results using metrics such as Sensitivity, Specificity, Precision, Accuracy, F-value, Kappa index, and McNemar's test, as well as examining common causes of errors in humans and GPT models. Results: GPT-3.5 showed slightly lower performance than physicians in all metrics, although without statistically significant differences. GPT-4 showed clear superiority in several key metrics. Particularly, it reached 96.8% in sensitivity, compared to 88.2% for GPT-3.5 and 88.8% for doctors. These, however, marginally outperformed GPT-4 in accuracy (97.7% vs. 96.8%). GPT-4 demonstrated greater consistency, replicating exact results in 76% of reports after 10 analyses, in contrast to 59% for GPT-3.5. Clinicians failed more to detect explicit comorbidities, possibly due to fatigue or distraction, while GPT models inferred non-explicit comorbidities more frequently, sometimes correctly, although also resulting in more false positives. Conclusion: The LLMs studied, with carefully designed prompts, show competence comparable to that of medical specialists in the interpretation of clinical reports, even in complex and confusingly worded texts. Also considering their superior time and cost efficiency, these models represent a preferred option over human analysis for data mining and structuring information in extensive sets of clinical reports.