AI chatbots are increasingly being used for sensitive health-related queries, it has become imperative to understand the quality of information passed through such chatbots and, importantly, whether it can be comprehended by the general public. The results of this study and prior research indicate that they have incosistent performance in delivering high-quality, easily understandable information.
Reliability of AI-Generated Health InformationThis study utilized DISCERN, EQIP, GQS, and JAMA indices to evaluate the reliability and quality of chatbot-generated health information. Each index was selected for its specific advantages and ability to assess different aspects of information quality and reliability. DISCERN and EQIP are well-established tools designed to evaluate the reliability and quality of written health materials. A prior study comparing these two scales found that both exhibit high reliability; however, DISCERN demonstrated superior inter-rater agreement, making it particularly effective for consistent evaluations across different reviewers [29]. DISCERN focuses on criteria such as the clarity, balance, and comprehensiveness of health-related information, while EQIP emphasizes more detailed content-specific evaluation, particularly in clinical settings. In contrast, GQS is a simpler and more subjective scoring system aimed at assessing the overall quality of information. While it lacks the depth and structure of DISCERN and EQIP, it provides a valuable perspective by considering user perception and the general utility of the material. JAMA, on the other hand, evaluates credibility based on authorship, attribution, disclosure, and currency.
Despite their utility, these indices were originally developed for evaluating static, written materials and may not be perfectly suited for assessing chatbot-generated responses. Chatbots, while producing written responses, differ from traditional written materials in their conversational nature and dynamic content generation. Nevertheless, studies have shown correlations between these indices, suggesting that their combined use offers a comprehensive evaluation of both reliability and quality from multiple perspectives [30]. By employing these diverse scoring systems, we aimed to provide a more nuanced understanding of chatbot performance.
Perplexity and Copilot outperformed ChatGPT and Gemini across several reliability metrics, particularly DISCERN and EQIP. This finding underlines the importance of source attribution in perceived reliability. Perplexity and Copilot were able to achieve higher reliability scores by consistently citing sources or providing clear references. In contrast, ChatGPT, which lacks verifiable sources, scored significantly lower on both the DISCERN and JAMA indices [31]. The lower scores for ChatGPT and Gemini may reflect their more generalized approach, where the emphasis on sourcing and authorship is not a prioritized. This can make it difficult to assess the credibility of the information, reducing trust in medical recommendations. Source transparency is a key determinant of perceived credibility in medical information. Users increasingly rely on AI chatbots for health-related queries, and when chatbots fail to cite reliable sources, they risk disseminating incomplete or potentially misleading information. This is particularly concerning in sensitive health topics such as STDs, where misinformation can lead to harmful consequences, including delays in seeking medical care or the spread of inaccurate prevention strategies.
Readability ChallengesReadability was another major focus of this study. Multiple readability indices were employed to comprehensively assess the complexity of chatbot-generated health information. While indices like Flesch Reading Ease and Flesch-Kincaid are widely used in the literature, prior research suggests that the SMOG formula may be better suited for evaluating healthcare-related materials [32]. This is because SMOG has been validated against 100% comprehension, is based on more recent criteria for determining reading grade levels [32]. By including all these indices, our analysis aimed to capture a holistic view of readability across different metrics. The results were consistent across all indices, indicating that the readability levels of all chatbot responses exceeded the recommended 6th-grade threshold, reinforcing the challenge these systems face in delivering easily understandable information to users with lower health literacy. This approach also allowed us to identify potential areas for improvement in simplifying the language and structure of AI-generated health content. For example, a response from ChatGPT regarding chlamydia treatment included technical terminology such as “macrolide antibiotic” and “bacterial protein synthesis inhibition at the ribosomal level,” which may be difficult for lay users to interpret. In contrast, a more patient-friendly phrasing would simplify the explanation by stating that chlamydia can be treated with antibiotics that stop bacterial growth.
This represents that in the future, AI-powered chatbots will be able to provide a great deal of information, but the level of difficulty of such responses could be not easy to interpret for users who have low levels of literacy. Given the importance of health literacy, it is of concern that information provided by these chatbots may not equally be accessible to all users. In sensitive topics where clear and easy-to-understand information is crucial, this represents a potential limitation with current AI chatbot technology.
Review of Previous LiteraturePrevious studies examining AI chatbots for medical contexts have similarly identified issues regarding source transparency and the readability of information provided. For example, in a study that compared the performance of five different chatbots on penile prosthesis-related questions, ChatGPT had lower reliability because there were no source citations, resulting in lower DISCERN scores [33]. Similarly in our study, ChatGPT and Gemini were classified as poor quality based on their DISCERN scores, while Perplexity and Copilot performed significantly better. This suggests that source citation transparency is very critical to creating trust in medical information generated on chat interfaces.
Likewise, a research evaluating the quality of chatbot-generated responses in urological malignancies found that while chatbots like Perplexity and Copilot offer reliable, accurate information, their readability remains an issue, with responses often written at a high reading level, making them inaccessible to patients with lower health literacy [7].
Another study evaluating kidney stone patient information materials also found that although AI chatbots improve accessibility to medical information, they often fail to meet readability guidelines, where their content mostly being too complex for the average patient [14].
A study evaluating AI chatbot responses to erectile dysfunction queries found significant differences in quality and readability between chatbots [13]. ChatGPT, Copilot, Bard, and others demonstrated variable performance across metrics like DISCERN, EQIP, and readability scores [13]. Another study examining premature ejaculation information from ChatGPT further support this, also report similarly low scores on EQIP and DISCERN, with the language used being complex [9].
The poor readability of chatbot-generated medical information is consistently highlighted in these studies. Responses related to erectile dysfunction-related topics, such as urological health and premature ejaculation, did not meet the recommended readability grade level in any of the studied chatbots [9, 12, 13]. This was consistent with the findings of the present study.
LimitationsThere are several limitations to this study that should be acknowledged. First, the analysis was based on a small sample of AI chatbot responses, specifically limited to sexually transmitted diseases. While this focus allowed for a detailed evaluation in one subject, the results may not be fully generalizable to other medical topics.
Another limitation is the dataset size and the potential for inherent biases in chatbot responses. The study relied on a limited number of queries, which may not fully capture the breadth of user inquiries regarding STDs. AI chatbots generate responses based on their training data, which inherently includes biases from the sources they were trained on. These biases can influence the accuracy, completeness, and framing of the information provided. Additionally, variations in chatbot responses due to different phrasing of queries or context-specific nuances could affect reliability assessments.
The use of Google Trends in this study comes with certain limitations. One key limitation is the lack of detailed demographic information, such as age and sex, as well as the inability to determine the exact sample size of the data. While previous research has demonstrated that web-based data can provide valuable and valid insights into behavior and often correlate with actual data, the reliability of online queries may be compromised in regions with low internet penetration [16,17,18, 34]. However, since our study focuses specifically on internet users, we consider these findings suitable for our objectives. Furthermore, as we conducted a worldwide analysis, the variation in questions asked about the topic across different regions, linguistic diversity and cultural nuances could influence the results [35, 36]. Despite these constraints, Google Trends remains a useful tool for examining global patterns in online health information searches.
The study relied on established readability and reliability metrics, which may not fully capture the nuances of AI-generated content. While DISCERN or Flesch Reading Ease gives important insight to the quality and accessibility of the information, neither of them take into account subjective factors like user trust or conversational tone generated by chatbots. Future research should explore modern evaluation frameworks, such as machine learning-based quality assessment models or real-time user engagement metrics, to better assess the reliability, readability, and practical utility of AI-generated health information.
Lastly, the AI chatbot models are constantly evolving, they get updated regularly. The versions of the model evaluated in this study may differ from future versions. This could impact the reproducibility of these results over time. Regular evaluations will be needed to see how the performance of these AI systems changes as they are updated.
Practical ImplicationsThe findings of this study have important implications for the development and implementation of AI chatbots in healthcare. Among the evaluated chatbots, Perplexity and Copilot emerged as the most reliable options, demonstrating higher DISCERN and EQIP scores compared to ChatGPT and Gemini. Their ability to provide clearer, more structured responses with some degree of source attribution makes them preferable for users seeking accurate medical information. However, even these chatbots did not achieve an “excellent” reliability rating, indicating the need for further improvements in the transparency and quality of AI-generated content.
The inconsistent reliability and poor readability of chatbot responses highlight the need for improvements in both areas. For healthcare providers and developers, this means prioritizing the inclusion of transparent citation mechanisms and simplifying the language used in responses. Ensuring that chatbot-generated content is accessible to users with lower health literacy is essential for equitable access to health information.
From a public health perspective, these findings underscore the importance of responsible AI deployment in medical communication. Healthcare professionals should be aware of chatbot limitations and guide patients toward more reliable information sources when necessary. AI chatbots, while promising, should be regarded as supplementary tools rather than replacements for professional medical consultation.
Future DirectionsFuture research should focus on developing new evaluation frameworks tailored to the unique characteristics of chatbot-generated content. These frameworks should account for the conversational nature and dynamic generation of responses, which differ from traditional written materials. Additionally, further studies are needed to explore how chatbot updates impact performance over time and to identify strategies for improving both reliability and readability. Expanding the scope of analysis to include diverse medical topics will also help generalize findings and enhance the practical utility of chatbots in healthcare.
Moreover, the inconsistent performance across chatbots suggests that current AI models vary significantly in how they retrieve and present medical knowledge. This highlights the need for standardized guidelines in AI-driven health communication, ensuring that chatbots provide evidence-based, clearly referenced, and easily verifiable content. Developers should prioritize integrating structured citation mechanisms and improving fact-verification algorithms to enhance the reliability of chatbot-generated medical advice. To enhance the reliability and accessibility of AI-generated health information, future research should explore ways to integrate domain-specific data into chatbot training models. Additionally, refining readability algorithms by simplifying sentence structures and reducing technical jargon could significantly improve user comprehension.
To ensure continuous assessment of AI chatbot reliability and readability, future studies should consider a longitudinal evaluation strategy. Given the frequent updates and refinements of chatbot models, periodic reassessments are necessary to track changes in their performance over time. A structured framework for ongoing evaluations would provide valuable insights into how modifications affect chatbot-generated health information. Such an approach could also aid in identifying trends and improvements in chatbot reliability and readability.
Comments (0)