Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

  • dandelion (she/her)@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    18
    ·
    edit-2
    4 hours ago

    link to the actual study: https://www.nature.com/articles/s41591-025-04074-y

    Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice.

    The findings were more that users were unable to effectively use the LLMs (even when the LLMs were competent when provided the full information):

    despite selecting three LLMs that were successful at identifying dispositions and conditions alone, we found that participants struggled to use them effectively.

    Participants using LLMs consistently performed worse than when the LLMs were directly provided with the scenario and task

    Overall, users often failed to provide the models with sufficient information to reach a correct recommendation. In 16 of 30 sampled interactions, initial messages contained only partial information (see Extended Data Table 1 for a transcript example). In 7 of these 16 interactions, users mentioned additional symptoms later, either in response to a question from the model or independently.

    Participants employed a broad range of strategies when interacting with LLMs. Several users primarily asked closed-ended questions (for example, ‘Could this be related to stress?’), which constrained the possible responses from LLMs. When asked to justify their choices, two users appeared to have made decisions by anthropomorphizing LLMs and considering them human-like (for example, ‘the AI seemed pretty confident’). On the other hand, one user appeared to have deliberately withheld information that they later used to test the correctness of the conditions suggested by the model.

    Part of what a doctor is able to do is recognize a patient’s blind-spots and critically analyze the situation. The LLM on the other hand responds based on the information it is given, and does not do well when users provide partial or insufficient information, or when users mislead by providing incorrect information (like if a patient speculates about potential causes, a doctor would know to dismiss incorrect guesses, whereas a LLM would constrain responses based on those bad suggestions).

    • SocialMediaRefugee@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      5 hours ago

      Yes, LLMs are critically dependent on your input and if you give too little info will enthusiastically respond with what can be incorrect information.

    • pearOSuser@lemmy.kde.social
      link
      fedilink
      English
      arrow-up
      3
      ·
      5 hours ago

      Thank you for showing other side of the coin instead of just blatantly disregarding it’s usefulness.(Always needs to be cautious tho)

      • dandelion (she/her)@lemmy.blahaj.zone
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 hours ago

        don’t get me wrong, there are real and urgent moral reasons to reject the adoption of LLMs, but I think we should all agree that the responses here show a lack of critical thinking and mostly just engagement with a headline rather than actually reading the article (a kind of literacy issue) … I know this is a common problem on the internet, I don’t really know how to change it - but maybe surfacing what people are skipping out on reading will make it more likely they will actually read and engage the content past the headline?