General-Purpose LLMs Outperform Specialised Clinical AI Tools, Nature Medicine Study Finds

A study published in Nature Medicine has found that general-purpose LLMs significantly outperform specialised clinical AI tools in medical diagnostics. In a blind evaluation involving 12 physicians and 1,800 real clinical queries, Gemini 3.1 Pro and GPT-5.2 achieved 97.4% and 94.2% accuracy respectively, while dedicated systems OpenEvidence and UpToDate Expert AI scored only around 80%. The authors attribute the gap to the specialised tools’ heavy reliance on RAG pipelines, where irrelevant retrieved literature and integration errors degrade output quality.

The study also revealed operational shortcomings in the medical systems: OpenEvidence produced chaotic responses, and UpToDate Expert AI rejected 19% of queries due to overly strict safety filters. Hallucination rates for the general-purpose models were no higher than those of their specialised counterparts. The researchers recommend that clinics conduct independent testing of AI products before deployment.

Nature Medicine paper