The promise and peril of AI chatbots in healthcare

Illustration of AI in healthcare with a doctor figure, showcasing technology's role in modern medicine with various medical icons around.
Banner: Vecteezy

Artificial Intelligence chatbots like ERNIE Bot, ChatGPT and DeepSeek can often outperform doctors in diagnosis, but we need safeguards to avoid overprescribing and reinforcing inequality

By Dr Yafei Si and Professor Gang Chen, University of Melbourne

Professor Gang ChenDr Yafei Si

Published 3 October 2025

If you've been to a medical appointment recently, you may have already interacted with AI. As you describe your symptoms to the doctor, they may ask your permission to use an 'AI scribe’  to convert audio into medical notes in real time.

Or maybe you’ve typed your symptoms into ChatGPT to get a possible diagnosis – sometimes reassuring, sometimes alarming.

AI scribe at GP
AI scribes convert audio into medical notes in real time. Picture: Getty Image

Artificial intelligence (AI) for healthcare is increasingly trialled in hospitals, clinics and even on our phones.

Chatbots powered by large language models are being promoted as a way to fill gaps in healthcare, especially where doctors are scarce.

But our new research has found that while these AI chatbots like ERNIE Bot, ChatGPT, and DeepSeek show promise, they also pose significant risks – ranging from overtreatment to reinforcing inequality.

Global tools, local risks

AI already plays a role in many areas of healthcare – from reading X-rays to powering triage chatbots.

Over 10 per cent of Australian adults reported using ChatGPT for health-related questions in the first half of 2024 – with many looking for clinical advice rather than basic information – highlighting AI’s growing influence in health decision-making.

But most research has focused on how accurate they are in theory, not how they behave with patients in practice.

Our study is among the first to rigorously test chatbot performance in simulated real-world consultations, making the findings particularly relevant as governments and hospitals race to adopt AI solutions.

We tested ERNIE Bot, a widely used Chinese chatbot, alongside OpenAI’s ChatGPT and DeepSeek, two of the most advanced global models.

We compared their performance with human primary care doctors using simulated patient cases.

We also tested disparity by systematically varying patient characteristics, including age, gender, income, residence and insurance status in standardised patient profiles and then analysing whether the chatbot’s quality of care changed across these groups.

Bone scan
AI already plays a role in many areas of healthcare, including reading X-rays. Picture: Shutterstock

We presented common daily symptoms like chest pain or breathing difficulties. For example, a middle-aged patient reports experiencing chest tightness and shortness of breath after engaging in light activity.

The bot or doctor is expected to ask about risk factors, order an ECG, and consider angina as a possible diagnosis.

A younger patient complains of wheezing and difficulty breathing that worsens with exercise. The expected response is to confirm asthma and prescribe appropriate inhalers.

The same symptoms are presented with different patient profiles — for example, an older versus younger patient, or a patient with higher versus lower income — to see if the chatbot’s recommendations changed.

Accuracy meets overuse and inequality

All three AI chatbots – ERNIE Bot, ChatGPT, and DeepSeek – were highly accurate at a correct diagnosis – outperforming human doctors.

But, AI chatbots were far more likely than doctors to suggest unnecessary tests and medications.

In fact, it recommended unnecessary tests in more than 90 per cent of cases and prescribed inappropriate medications in more than half.

For example, when presented with a patient wheezing from asthma, the chatbot sometimes recommended antibiotics or ordered expensive CT scans – neither of which are supported by clinical guidelines.

And AI performance varied by patient background.

For example, older and wealthier patients were more likely to receive extra tests and prescriptions.

Our findings show that while AI chatbots could help expand healthcare access, especially in countries where many people lack reliable primary care, without oversight, they could also drive up costs, expose patients to harm and make inequality worse.

AI chat on screen
There is an urgent need to co-design safe and responsible AI Chatbots for use in daily life. Picture: Shutterstock

Healthcare systems need to design safeguards – like equity checks, clear audit trails and mandatory human oversight for high-stakes decisions – before these tools are widely adopted.

Our research is timely, given the global excitement – and concern – around AI.

While chatbots could help fill critical gaps in healthcare, especially in low and middle-income countries, we need to carefully balance innovation with safety and fairness.

Co-designing AI for Safety and Justice

There is an urgent need to co-design safe and responsible AI Chatbots for use in daily life, particularly in delivering reliable health information.

AI is coming to healthcare whether we are ready or not.

By identifying both its strengths and risks, our study provides evidence to guide how we use these powerful new tools safely, fairly and responsibly.

We are hoping to continue this critical area of research in Australia to ensure AI technologies are developed with equity and trust at their core and are beneficial for our community.

The research team welcomes collaborations to advance this work.

Find out more about research in this faculty

Medicine, Dentistry and Health