Lésions apicales : quand l'IA répond aux questions des patients

Dans un contexte où les patients sollicitent de plus en plus l’intelligence artificielle (IA) pour i...

AI and apical lesions: How reliable is it when addressing patient inquiries?

In a context where patients are increasingly consulting artificial intelligence (AI) to interpret their radiographs, the practitioner is faced with a new form of digital self-diagnosis. The issue is particularly critical in endodontics: differentiating between an odontogenic pathology and a non-odontogenic lesion requires clinical expertise that language models (LLMs) could compromise through errors or oversimplifications.

This study aimed to precisely evaluate the reliability and clinical appropriateness of the responses provided by three publicly accessible AI platforms — ChatGPT (GPT-4o), Grok (Grok-1) and DeepSeek (DeepSeek-V2) — to 15 standardised questions simulating lay queries on apical lesions. The objective was to determine whether these tools can serve as a reliable support for therapeutic education or if they induce a risk of misinformation.

The authors tested the ability of these models to provide acceptable and clear information, by comparing their performance via an expert panel (endodontist and oral pathologist). The underlying hypothesis evaluated whether these LLMs, despite their generalist design, could offer sufficient accuracy to correctly guide a patient without substituting professional clinical judgement.

Methodology: Comparative analysis of three LLMs by a panel of experts

This cross-sectional comparative study evaluated the reliability and clinical relevance of three language models (LLMs) in response to typical patient questions about apical lesions. The protocol is based on 15 standardised questions, formulated in non-technical language, covering four clinical domains: lesion identification, differentiation between odontogenic and non-odontogenic causes, therapeutic management, and risks associated with follow-up.

The experimental protocol used the following versions of the platforms:

ChatGPT (version GPT-4o);
Grok (version Grok-1);
DeepSeek (version DeepSeek-V2).

Each question was submitted only once (a "one-shot" approach) via the standard public interface, without follow-up prompts or contextual refinement, thereby generating a total of 45 unique responses. Two independent experts — a certified endodontist and an oral pathologist — analysed the results. In total, 90 evaluations were conducted on a 5-point Likert scale (from 1 "strongly disagree" to 5 "strongly agree"), assessing clarity and clinical adequacy for therapeutic education.

Inter-rater reliability was measured by Cohen's Kappa coefficient (0.85). The differences in score distribution between the platforms were subjected to the non-parametric Kruskal-Wallis H test, with a significance level set at p < 0.05.

Rater reliability and agreement

The inter-rater reliability analysis, involving a certified endodontist and oral pathologist, revealed a Cohen's kappa coefficient of 0.85. This result indicates an almost perfect agreement between the experts regarding the reliability and clinical appropriateness of the responses generated by artificial intelligences for patient education.

Quantitative analysis of performance by platform

Out of a total of 30 individual evaluations per platform (15 standard questions evaluated by two experts), the majority of responses achieved the maximum score of 5 on the Likert scale. ChatGPT distinguished itself with a complete absence of scores below 4.

Platform	Score 5 (Strongly agree)	Score 4 (Agree)	Score 1 (Strongly disagree)
ChatGPT (GPT-4o)	27	3	0
Grok (Grok-1)	24	5	1
DeepSeek (DeepSeek-V2)	23	6	1

Statistical significance and comparisons

The non-parametric Kruskal-Wallis H test revealed no statistically significant difference in the distribution of agreement scores between the three platforms (H = 2.05; df = 2; p = 0.36). Although ChatGPT numerically achieved the highest proportion of "strongly approved" responses, the three chatbots demonstrate comparable performance levels in addressing patient-oriented endodontic questions.

Qualitative observations and response profiles

The expert evaluation revealed distinct communication profiles depending on the models:

ChatGPT: This model tends to provide the most detailed and in-depth clinical explanations.
Grok and DeepSeek: These platforms were perceived as using simpler and more accessible language, favouring communication centred on the lay patient.

It should be noted that Grok and DeepSeek each generated a response that received a score of 1, reflecting an isolated case of major expert disagreement regarding the clinical relevance of the information provided.

Differential diagnosis challenged by algorithms

The results of this study mark a key milestone in the integration of artificial intelligence (AI) within practitioner-patient communication. With an agreement coefficient of 0.85 between the experts (endodontist and pathologist), the reliability of the responses generated by the chatbots to explain apical lesions proves remarkable. The absence of a statistically significant difference between the three platforms (p = 0.36) suggests a homogeneous technological maturity for scientific popularisation.

However, the qualitative analysis reveals nuances in usage: ChatGPT stands out with enhanced clinical precision (27 out of 30 responses achieving the maximum score), whereas Grok and DeepSeek favour a more accessible syntax. This distinction is crucial in the practice: AI no longer merely translates complex terms, it adapts its level of discourse according to the interlocutor. However, the scores of 1 occasionally awarded to Grok and DeepSeek serve as a reminder that these models are not infallible and can still produce approximations on critical points such as long-term follow-up.

The limitations of this study lie in its exclusively textual format and its sample of 15 questions. In endodontic practice, diagnosis relies on imaging analysis, a dimension absent from this protocol. These tools must therefore be considered as educational aids to alleviate patient anxiety regarding a "dark spot" on the radiograph, without ever replacing the paramount clinical examination.

Study summary

This comparative evaluation demonstrates that ChatGPT, Grok and DeepSeek produce clinically appropriate responses to patient enquiries regarding apical lesions, with excellent inter-rater reliability (Kappa 0.85). Although ChatGPT displays the highest proportion of expert-validated responses for its clinical accuracy, no statistically significant difference was observed between the three platforms (p = 0.36) regarding the overall quality of the information provided.

In practical terms, for the practitioner:

Optimise your patient communication: Use these tools as educational adjuncts; ChatGPT is recommended for detailed clinical explanations, whereas Grok and DeepSeek perform better for simple and accessible lay explanations.
Maintain diagnostic vigilance: Despite high scores, isolated cases of expert disagreement have been noted. AI must complement, and not replace, your clinical judgement, particularly to differentiate odontogenic pathologies from sinus or tumoural causes.
Anticipate online searches: As patients increasingly submit their radiographs to AI, integrate these topics (risks, follow-up, necessity of treatment) into your information protocols to reinforce your role as a scientific authority.

Technical glossary of the study

LLM (Large Language Models): Algorithmic language models, such as GPT-4o, Grok-1 and DeepSeek-V2, trained to process and generate natural language by mimicking human cognitive processes, used here for therapeutic education in endodontics.

Endodontic apical lesions: Radiographic alterations located at the root apex, requiring nuanced clinical interpretation to distinguish inflammatory pathologies from anatomical or tumoural variants.

5-point Likert scale: Ordinal assessment method used by experts (from 1 'strongly disagree' to 5 'strongly agree') to quantify the reliability, clarity and clinical relevance of the responses provided by the AI.

Cohen's kappa coefficient: Statistical measure of inter-rater reliability. The value of 0.85 reported in the study indicates an almost perfect agreement between the endodontist and the oral pathologist in their assessment of the responses.

Kruskal-Wallis H test: Non-parametric statistical test applied to compare the distributions of reliability scores between the three AI platforms, concluding an absence of significant difference (p = 0.36).

Odontogenic vs non-odontogenic pathoses: Diagnostic differentiation between lesions of dental origin (e.g. apical periodontitis) and those originating from adjacent structures (e.g. maxillary sinus, non-dental cysts), a complex field tested on chatbots.

Source

Original title: Reliability of Artificial Intelligence Chatbots in Answering Patient-Oriented Questions About Endodontic Apical Lesions
Authors: Faraj Alotaiby, Waleed Almutairi
Publication: Cureus - 2026-04-27
DOI: https://doi.org/10.7759/cureus.107783

Information intended for healthcare professionals. This content may contain errors or truncated summaries. We recommend always checking against the original source article. Delynov disclaims all liability regarding the use of this information. This document is not intended for patients or the general public.

Lichen plan oral : le laser surpasse-t-il les corticoïdes ?

Le lichen plan oral (LPO) est une pathologie auto-immune chronique médiée par les lymphocytes T, car...