Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery

Previous monocentric studies evaluated the feasibility of LLMs in providing recommendations for tumor treatment regimes in ORL, head and neck surgery [16,17,18]. However, most of these studies were dedicated to older versions and mainly focused solely on ChatGPT, with one exception evaluating Claude 3.

Our study however is the first one evaluating the web-based ChatGPT-4o and, in order to address data protection queries, the locally run LLM (Llama 3). In contrast to prior studies, recommendations of the LLMs were directly compared to human tumor board decisions and subsequently rated by experts. More specific, four MDT members evaluated the LLMs' recommendations for medical adequacy using a 6-point Likert scale and assessed whether the provided information could have influenced the MDT's final decision.

Our findings reveal high concordance between the LLMs and the MDT, particularly in the critical distinction between curative and palliative therapy strategies. Llama 3, the locally run model, exhibited a 92% (23/25) concordance rate with the MDT, while ChatGPT-4o achieved 84% (21/25). ChatGPT-4o mentioned all first-line recommendation of the MDT in 64% (16/25) of the patients, Llama in 60% (15/25). While ChatGPT-4o stated the same first-line recommendations as the MDT as first-line in 52% (13/25) of the cases, Llama 3 did so in 48% (12/25). Whereas ChatGPT-4o stated the first-line recommendation of the MDT in parts as first-line in 12% (3/25) more cases, Llama 3 achieved the same in 8% (2/25) more cases. Accordingly, in 72% (18/25) ChatGPT-4o and in 68% (17/25) of the cases Llama 3 stated at least one identical MDT first-line therapy regimen as first-line. This competence of the current LLMs both on and off-line is surprisingly sufficient, especially when considering that previous studies in ORL were more negative regarding individualized therapy recommendations of LLMs [16, 17].

First-line surgery as sole therapy was most frequently suggested by the MDT and LLMs alike, with the MDT for 48% (12/25), ChatGPT-4o for 52% (13/25) and Lama for 64% (16/25) of the cases. While the MDT provided a more detailed surgical therapy plan i.e. “local resection and neck dissection ipsilateral” the LLMs frequently lack specification of the surgery. This is particularly problematic if the LLM, like Llama did in an exemplary case (M1), concentrates on the resection of the primary tumor and does not mention a necessary neck dissection.

To provide a realistic setting, we intentionally implemented some obstacles into the patient profiles to provide a more realistic setup of flawed and/or misleading submissions to the MDT. In one case a patient with cT3 cN3 cM0 Hypopharynx Carcinoma (P3) without any actual histopathological confirmation, both ChatGPT-4o and Llama 3 suggested treatment plans whereas the MDT recommended a panendoscopy with subsequent biopsy first. In this case, ChatGPT-4o suggested a palliative therapy for a tumor which was not histologically proven at that time. Llama 3 recommended primary chemoradiotherapy as first line therapy although neither histological evaluation nor a surgeon's statement regarding resectability was apparent.

A further pitfall for the LLMs was a cancer of unknown primary (CUP) Syndrome (CUP1). Here none of the LLMs suggested a CUP panendoscopy or a PET CT like the MDT. In addition, Llama did not realize that the primary tumor localisation is in fact unknown and therefore its resection is not feasible.

Furthermore, in one instance (M2) Llama did not realize that the patient had already undergone surgery with curative intent and that the presentation was only for adjuvant therapy. Here, one can argue that the prompt should have been adapted, as MDT submissions usually also provide information about a pre- or post-therapeutic presentation. However, the simulated case represents a scenario of an imperfect submission, which unfortunately does occur in everyday clinical practice.

Moreover, we noticed that Llama 3 ignored the German input by answering in English instead of German all the time. The same happened with regards to the word limit, not only did Llama 3 exceed the word limit nearly every time, in one case Llama 3 even claimed to respect the word limit even though it was exceeded. These examples suggest that LLMs tend to ignore instruction if they are not capable of fulfilling them, rather than admitting deficiencies. Especially for tumor treatment plans this is hazardous and should be considered when consulting LLMs in these matters. This fact underlines the inference of previous studies (in ORL and beyond) that LLMs are very promising for augmenting MDT but cannot and should not replace them [14,15,16,17].

Despite some fundamental flaws, the LLMs received respectable ratings for medical adequacy on a 6-point Likert-scale by our 4 evaluating MDT members visualized by a mean score of 4.7 (IQR 4–6) for ChatGPT-4o and 4.3 (IQR 3–5) for Llama 3 (Fig. 3). This might be explained by the fact that allegedly less important facts, such as ECOG status or smoking history, can easily fade into the background in a MDT whilst they were frequently highlighted by the LLMs. In our study, the MDT had suggested a curative treatment regimen for a patient with ECOG stage 3 (L2). Here, two raters stated that the information from the LLM might have changed the recommendation of the MDT. Raters found it positive that the LLMs highlighted chronic diseases of the patients and addiction history. Overall, raters stated in 17% (33/150) of the ratings, that the information of the LLM might have changed the recommendation of the board. In individual cases, the use of LLMs can therefore potentially have a very positive influence on the fate of a patient. Particularly when considering the fact that the resources for an MDT can vary enormously depending on the environment of MDTs [2].

Obviously, this study has some limitations, including the fact that constructed cases instead of real patient profiles were used. However, even after anonymization of data sets, entry of real cases into a web-based LLM was not approved by our data protection officer. As a comparison of the web-based LLM ChatGPT-4o and the locally run LLM Llama 3 was a key objective of this study we were limited to constructed patient profiles. Moreover, one could argue that a MDT resembling members from different affiliations is no realistic scenario, however, we choose this design by default in order to limit subjective and institutional biases within the MDT decision. In fact, we noticed local differences in default treatment regimens, for instance whereas in one institution a dental renovation is implemented prior to radiation therapy, other institutions waive the step in order to accelerate the start of the treatment. This emphasizes exemplary the virtue of a multicentric approach.

Taking these limitations into account, the present study is the first study evaluating a locally run LLM (Llama 3) on ORL, head and neck surgery MDT recommendations. Interestingly, the results of the local LLM (Llama 3) were inferior to the web-based ChatGPT-4o. With regard to the decision on a curative or palliative procedure, the agreement between the MDT and the local Llama 3 LLM was even higher. Since this is the first study in ORL, head and neck surgery evaluating a local LLM, the potential for optimization is certainly not exhausted yet. First, newer versions and the comparison of different local LLMs might improve results. Secondly, special training on specific ORL, head and neck literature and aligned prompting might further improve the outcome. It is therefore likely that local LLMs will catch up with the web-based versions in the near future. Especially in medicine, passing highly sensitive private patient data to privately owned, web-based LLMs is hardly imaginable in clinical practice [9]. Despite data privacy, the open-access structure of local LLMs is quite cost-effective requiring just a sole standard computer available for less than €1000. Furthermore, local LLMs provide the opportunity of deployment in remote regions without internet connection. Thus, these models represent also a great opportunity for low-resource settings in low and middle income countries (LMICs). Future research regarding medical application should therefore focus on local LLMs.

With regards to the local and open-source character of Llama 3, the results of this study might reflect an important step to actual implementation of locally run, data protection compliant LLMs into real clinical practice. Despite our promising findings, it is unlikely that MDTs are close to being replaced by LLMs. Human assessment in tumor board decision-making currently proves more reliable than LLM-driven assessments, especially as they will be reluctant to put any decision making regarding their therapeutic faith in the hands of a computer [19]. Additionally, the medicolegal implications are entirely unclear at this point. Rather MDTs are likely to be augmented by locally run LLMs helping to reduce geographical bias of tumor boards [2, 20]. Future studies should thus evaluate the integration of (local) LLMs in the MDT process.

Comments (0)

No login
gif