AI in Medicine: Hubris, Hype, and Half-Science

Due to time constraints, I haven’t been able to maintain my blog since June 2023, which is why there haven’t been any new posts since then. However, the following article (published in Laborjournal 4/2023) fits very well with the last post, “Artificial Intelligence: Critique of Chatty Reasoning” and since I’ve received many requests for an English version, I asked ChatGPT (which, by the way, didn’t take offense at the content!) to provide a translation. The original German version can be found in Laborjournal.

AI has the potential to revolutionize medicine — but a reckless ‘move fast, break things’ mindset, industry lobbying, and glaring scientific gaps in transparency, validation, and bias control are getting in the way of building truly evidence-based AI.

“Chatbot beats doctors at diagnosis!” That’s the kind of headline we’ve seen recently—from X to the New York Times. A randomized controlled trial led by a shiny lineup of authors (Stanford, Harvard, Beth Israel Deaconess, you name it) supposedly proved that ChatGPT generates more accurate diagnoses than medical professionals at some of the world’s top university hospitals. The AI scored 92% accuracy, compared to 74% for the doctors. And even when doctors were allowed to use the chatbot, they barely improved (up to just 76%). So the takeaway? AI isn’t just better at diagnosing—it also fails to teach doctors anything useful!

This is a perfect example—though not of AI’s supposed brilliance in medicine. Rather, it’s a showcase of the over-the-top AI hype and the poor quality of studies in this field. Which is why the I had to weigh in once more—having already shared some fundamental thoughts on AI’s so-called “intelligence” a while back.

Let’s start with the hype—because that’s the easy part. By now, even the most wide-eyed optimists have probably realized just how overblown the promises around AI and healthcare have become. Two quick examples, just to illustrate the point:

Deloitte, the consulting giant, predicts that AI will save 400,000 lives, €200 billion, and 1.8 billion working hours every year in the EU alone. Yes, every year. Meanwhile, Sam Altman, the CEO of OpenAI, recently got up on stage alongside Donald Trump to launch the “Stargate Project” (a casual $500 billion for AI in the U.S.) and declared: “AI will help cure diseases at unprecedented speed. We’ll be amazed how fast we cure this cancer or that one, and heart disease too. This ability to cure disease rapidly, I think, will be one of the most important achievements of this technology.”

Those of you who have been around for a while —this isn’t the first time the AI bandwagon has rolled through town. It’s the fourth. Back in 1967, AI pioneer Marvin Minsky confidently predicted that “the AI problem will be solved within a generation.” In the 1980s, the buzz was all about expert systems taking over the roles of doctors and lawyers. Then came Deep Blue beating Garry Kasparov at chess, and suddenly people felt like AGI via neural networks was just around the corner.

To be fair, ChatGPT isn’t wrong when it speaks confidently about AI history: the hype comes and goes—but the progress keeps marching on.

The science of AI in medicine is about to take a hit—but first, let’s ask: how much AI is actually out there? How much of it is already in real, day-to-day use in medicine? Because the truth is, we’ve been trying to bring AI into healthcare for well over a decade now—with a lot of effort and an enormous amount of money.

Take IBM’s Watson for Oncology. Launched in 2010. Officially buried—along with IBM’s entire AI Health division—in 2022.

Right now, there are about 1,000 AI and machine learning algorithms approved by the FDA. Most of them are used in medical imaging—radiology and cardiology, to be specific. Sounds like a lot? It’s really not—especially considering how much time and money has gone into getting us here. Plus, most of these approvals are clustered in just a few specialties, and many of them overlap heavily in terms of what they actually do.

But here’s the more interesting question: how many of these AI tools have made it into medical society guidelines with a high level of evidence (Level I)? You can count them on two hands. And even then, they’re not being recommended because they improve patient outcomes. They’re in there because they help doctors work more efficiently. In other words, they save time.

Which isn’t a bad thing—but it’s also not exactly groundbreaking.

The fact that so little AI is actually recommended in clinical guidelines isn’t exactly surprising. Hard empirical evidence showing that AI improves diagnostics or treatment outcomes—or even that it’s cost-effective in healthcare—is rare. And when such evidence does exist, it’s often methodologically shaky.

Most studies focus on technical performance metrics or clinical feasibility—they’re essentially just proof-of-concept projects. Robust health economics evaluations? Almost nonexistent. One systematic review found just 86 randomized controlled trials in the entire field, and over two-thirds of them were either too small or had questionable methodology.

Without high-quality studies, we simply don’t know whether specific AI tools actually save money, improve clinical decisions, or lead to better outcomes for patients. For all we know, they might even make things worse.

Which brings us to the science behind AI. And, surprise—just like many other areas of research, AI is smack in the middle of a reproducibility crisis. Only about 5% of AI researchers share their source code, and fewer than a third provide access to their training or validation datasets. That means reproducing their results isn’t just difficult—it’s often impossible. And in the rare cases where replication was possible and actually attempted, fewer than a third of key findings could be reproduced.

But how can something that can’t be repeated serve as a solid foundation for building reliable tools in biomedicine?

Reproducibility is a challenge across all fields of science, no doubt. But with AI-generated results, there’s an extra layer of issues that are unique to the technology itself. Randomness and stochastic behavior in deep learning can lead algorithms to produce different outputs—even with the same inputs. A lack of standardization during preprocessing—like how data is labeled for classification—can dramatically affect model performance. And differences in hardware or software environments, like using chips from different manufacturers, can also produce inconsistent results.

And it doesn’t stop there. Versioning issues—like switching between different versions of a machine learning library—can lead to major differences in outcomes. Then there’s the problem of dataset availability and variability: many healthcare datasets are proprietary and locked away, making independent replication nearly impossible. Overfitting to specific training datasets is another major concern. When models rely too heavily on the same handful of datasets, interpreting their performance becomes murky. And let’s not forget selective reporting—only the best results get published, while the less flattering runs quietly disappear.

On top of all that, AI algorithms are usually black boxes. Their outputs are often anything but transparent. Machine learning methods are atheoretical, associative, and frequently opaque. That makes explainability a central challenge in AI. If users can’t understand how a model reached its decision, it becomes much harder to spot errors—or correct for bias.

At the same time, people have a hard time accepting predictions or recommendations when they can’t make sense of how they were reached. This becomes especially tricky thanks to the clash of two opposing psychological effects: algorithmic aversion—where people are skeptical of decisions made by algorithms—and automation bias, where people blindly trust automated systems. Some users question or outright reject AI-assisted decisions; others accept them without hesitation, relying less on their own judgment or on manual checks.

This dynamic makes the responsible use of AI even more complicated—especially when its outputs are opaque and hard to explain.

And let’s not forget: a substantial chunk of what we’ve considered “causal knowledge” in medicine has later turned out to be wrong. Treatments based on clearly flawed theories—but supported by randomized controlled trials—are still being used. And we tend to patch over the cognitive dissonance by coming up with a new, better-fitting theory. Boom—explainability restored.

That’s why insisting on full explainability for all AI decisions in medicine might be not only unnecessary, but actually counterproductive. What we really need is a case-by-case assessment for each clinical AI tool to determine how much explainability is actually required. Which, of course, doesn’t make things any easier.

One of the core issues with machine learning–based AI is data drift. This happens when the data evolves with the world around it, but the algorithm stays frozen in the time it was trained. And in medicine, that kind of change is constant. Clinical practices shift, population structures change, and so much more. In fact, AI can even become a victim of its own success—if it helps improve diagnostics or treatments, there’s a good chance its predictions and recommendations will become less accurate as a result.

And when that happens, it can cost lives. That’s exactly what went wrong with Epic’s Sepsis Prediction Model. It was used by one of the biggest healthcare corporations in the U.S. to assess the risk of patients developing sepsis and to guide treatment accordingly. For a while, it worked well. But then the ICD coding for sepsis changed, and the company acquired additional hospitals that hadn’t been part of the original training data—and suddenly, the model began to fail.

This is a textbook example of the generalization problem in AI-driven healthcare systems. IBM’s Watson for Oncology also worked reasonably well—at Memorial Sloan Kettering. But it didn’t translate to other hospitals. Optum’s healthcare AI system showed racial bias, rating Black patients as lower risk than they actually were. Google’s retinopathy detection model performed well in studies but failed in real-world clinics in Thailand. The NHS’s and Babylon Health’s AI gave misleading or uncertain medical advice. COVID-19 prediction models fell apart in real-life conditions. Google’s breast cancer and Stanford’s pneumonia models looked great in testing, but flopped in the clinic. Google Flu Trends famously crashed and burned. Skin cancer detection models performed worse on darker skin. PathAI’s diagnostic AI gave inconsistent or outright wrong cancer diagnoses.

And that’s just a sample. The list goes on.

And at the heart of it all is the bias problem. Every AI model is vulnerable to it—but especially today’s massively hyped large language models (LLMs). They risk amplifying the biases already baked into scientific literature and research culture. This happens at every stage of the AI lifecycle: from data collection and annotation, to model development, deployment, and evaluation.

A lot of published scientific information is flawed, outdated, or biased. And when AI models are trained on that kind of data, they don’t just absorb those flaws—they replicate and amplify them. One of the biggest challenges is telling trustworthy sources apart from questionable ones, and distinguishing neutral study designs from biased ones. Even seasoned experts often struggle with that.

But wait—don’t we already have a solution to all this? Can’t we just make AI “trustworthy”? Wasn’t there a high-level expert group from the European Commission that published ethical guidelines for trustworthy AI back in 2019? And didn’t the EU’s brand new Artificial Intelligence Act (all 144 pages of it!) build on that? So isn’t trustworthy AI already enshrined in law, at least in the EU? There are also recent positions from the German Ethics Council, the German Medical Association, and others—all pointing in the same direction.

There’s no shortage of ethical guidelines for AI research and development—but there’s a massive gap between high-level principles and what actually happens on the ground. Meta-studies have identified nearly 100 such frameworks, yet “unethical” AI applications keep popping up. Most of these so-called trustworthy AI guidelines are far too abstract to be useful—over 75% consist of vague principles, and more than 80% offer little to no practical advice for researchers or developers. They’re great at telling you what should be done, but not how to do it.

In fact, there’s now an entire market for Trustworthy AI. Developers and companies can go “ethics shopping” to find a guideline vague enough to fit their product—or they just write their own. That’s what we call “ethics washing” or, more cynically, “regulatory capture.” Microsoft is a good example: they’ve launched four initiatives to shape AI guidelines in healthcare under the banner of the “Trustworthy and Responsible AI Network (TRAIN).” They’re providing experts, technical resources, and funding. Which, of course, also lets them help define the testing standards and regulations—ensuring their own technology gets preferential treatment and raising the barrier for competitors.

What often gets overlooked in all of this is that most so-called “AI research” is really more engineering than actual science. In healthcare, the focus of current AI work is mostly on exploring solutions and building applications, along with feasibility studies—proofs of concept (PoCs)—that haven’t been properly validated in the real world. And there’s no shortage of these. In fact, they’re multiplying. And more often than not, they’re sold to us as something they’re not—as mature, “transformative” tools ready for deployment.

But truly validated AI tools—validated through high-quality randomized controlled trials? You could count those on two hands.

This distinction between exploration or PoCs on the one side, and confirmation and validation on the other, is almost completely ignored—both in how studies are evaluated and in discussions about what makes AI trustworthy. But it’s crucial. We should have different levels of trust for different phases of development. The requirements for training data quality and quantity, explainability, transparency, bias mitigation—all of that—should depend on whether we’re talking about a prototype or a tool being rolled out in clinical practice.

And that, in turn, should be the key question: is this AI ready to be used on real patients—or not?

The study mentioned at the beginning—the one supposedly proving that ChatGPT outperforms doctors in making diagnoses—isn’t just a prime example of media-driven AI hype. It’s also a case study in how poor the science in this space can be. The sample size? Six. Yes, six. And even then, they used the wrong statistical method. The late, great British statistician Douglas Altman once put it bluntly: “n=8 is a dinner party, not a study.”

And it wasn’t even about making correct diagnoses—it was about “diagnostic reasoning,” measured with an arbitrary, unvalidated scale. That’s just one of many issues. Honestly, this paper should never have been published in The Lancet Digital Health. So much for the widely overestimated gatekeeping power of peer review—something I have criticized before (see LJ 10/2020 – in German).

But it can be done right. A newly published, well-designed, single-blinded randomized controlled trial studied the accuracy of mammography screening in over 105,000 women. The results suggest that AI can help detect clinically relevant breast cancer earlier while reducing the workload—without increasing the number of false positives.

Of course AI has the potential to improve medical diagnostics, clinical decision-making, prognostics, speed up drug development, and push forward wearable health tech—just to name the most often-cited use cases. But right now, the development of AI tools that are effective, safe, trustworthy, fair, and sustainable is being held back. Why? A “move fast, break things” mentality from commercial developers, and intense lobbying for industry-friendly regulation. Meanwhile, the science—blinded by the hype—shows serious weaknesses: poor transparency, shoddy reporting, weak bias mitigation, and a lack of proper validation.

Bottom line? We need evidence-based AI.

I thank Vince Madai for the thoughtful critiques and engaging discussions.

References and further reading:

Antes, G. Big Data und Personalisierte Medizin: Goldene Zukunft oder leere Versprechungen? – Deutsches Ärzteblatt.  Retrieved March 11, 2025, from https://www.aerzteblatt.de/archiv/big-data-und-personalisierte-medizin-goldene-zukunft-oder-leere-versprechungen-d80c87d9-62e0-48d0-92df-ca78917d7254

Antes, G. Eine neue Wissenschaft-(lichkeit)?  Retrieved March 11, 2025, from https://www.laborjournal.de/editorials/981.php

Antes, G. Häussler B. „Big Data: zwischen Big Chance und Big Error“ – Monitor Versorgungsforschung.  Retrieved March 11, 2025, from https://www.monitor-versorgungsforschung.de/abstract/big-data-zwischen-big-chance-und-big-error/?cookie-state-change=1741704377072

Bélisle-Pipon, J. C., Monteferrante, E., Roy, M. C., & Couture, V. (2023). Artificial intelligence ethics has a black box problem. AI and Society, 38(4), 1507–1522. https://doi.org/10.1007/S00146-021-01380-0

Benjamens, S., Dhunnoo, P., & Meskó, B. (2020). The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. Npj Digital Medicine, 3(1). https://doi.org/10.1038/S41746-020-00324-0

Bezuidenhout, L., & Ratti, E. (2021). What does it mean to embed ethics in data science? An integrative approach based on microethics and virtues. AI and Society, 36(3), 939–953. https://doi.org/10.1007/S00146-020-01112-W

Bürger, V. K., Amann, J., Bui, C. K. T., Fehr, J., & Madai, V. I. (2024). The unmet promise of trustworthy AI in healthcare: why we fail at clinical translation. Frontiers in Digital Health, 6, 1279629. https://doi.org/10.3389/FDGTH.2024.1279629/BIBTEX

Canca, C. (2020). Operationalizing AI ethics principles. Commun. ACM, 63(12), 18–21. https://doi.org/10.1145/3430368

Chapman University. Bias in AI  Retrieved February 28, 2025, from https://www.chapman.edu/ai/bias-in-ai.aspx

Collins, G. S., Moons, K. G. M., Dhiman, P., Riley, R. D., Beam, A. L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J. B., Van Smeden, M., Boulesteix, A. L., Camaradou, J. C., Celi, L. A., Denaxas, S., Denniston, A. K., Glocker, B., Golub, R. M., Harvey, H., Heinze, G., … Logullo, P. (2024). TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ, 385. https://doi.org/10.1136/BMJ-2023-078378

Davenport, T., & Kalakota, R. (2019). The potential for artificial intelligence in healthcare. Future Healthcare Journal, 6(2), 94–98. https://doi.org/10.7861/FUTUREHOSP.6-2-94

Denniston, A. K., & Liu, X. (2024). Responsible and evidence-based AI: 5 years on. The Lancet Digital Health, 6(5), e305–e307. https://doi.org/10.1016/S2589-7500(24)00071-2

Dubovitskaya E. XAI in AI Act: Interview on Transparency & Corporate Obligations.  Retrieved March 5, 2025, from https://www.statworx.com/en/content-hub/interview/the-role-of-explainable-ai-xai-in-eu-law-compliance/

Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupke, A., Zittrain, J., Kohane, I. S., & Saria, S. (2021). The Clinician and Dataset Shift in Artificial Intelligence. New England Journal of Medicine, 385(3), 283–286. https://doi.org/10.1056/NEJMC2104626/SUPPL_FILE/NEJMC2104626_DISCLOSURES.PDF

Floridi, L. (2019). Translating principles into practices of digital ethics: five risks of being unethical. Philos. Technol., 32(2), 185–193. https://doi.org/10.1007/s13347-019-00354-x

Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harv. Data Sci. Rev. https://doi.org/10.1162/99608f92.8cd550d1

Gichoya, J. W., Thomas, K., Celi, L. A., Safdar, N., Banerjee, I., Banja, J. D., Seyyed-Kalantari, L., Trivedi, H., & Purkayastha, S. (2023). AI IN IMAGING AND THERAPY: INNOVATIONS, ETHICS, AND IMPACT: REVIEW ARTICLE AI pitfalls and what not to do: mitigating bias in AI. British Journal of Radiology, 96(1150). https://doi.org/10.1259/BJR.20230023/7498925

Gilbert, S., Dai, T. & Mathias, R. Consternation as Congress proposal for autonomous prescribing AI coincides with the haphazard cuts at the FDA. npj Digit. Med. 8, 165 (2025). https://doi.org/10.1038/s41746-025-01540-2

Gille, F., Jobin, A., & Ienca, M. (2020). What we talk about when we talk about trust: Theory of trust for AI in healthcare. Intelligence-Based Medicine, 12. https://doi.org/10.1016/J.IBMED.2020.100001

Goirand, M., Austin, E., & Clay-Williams, R. (2021). Implementing Ethics in Healthcare AI-Based Applications: A Scoping Review. Science and Engineering Ethics, 27(5). https://doi.org/10.1007/S11948-021-00336-3

Gundersen, O. E., & Kjensmo, S. (2018). State of the art: Reproducibility in artificial intelligence. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, 1644–1651. https://doi.org/10.1609/AAAI.V32I1.11503

Hagendorff, T. (2020). The ethics of AI ethics: an evaluation of guidelines. Mind. Mach., 30(1), 99–120. https://doi.org/10.1007/s11023-020-09517-8

Han, R., Acosta, J. N., Shakeri, Z., Ioannidis, J. P. A., Topol, E. J., & Rajpurkar, P. (2024). Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. The Lancet Digital Health, 6(5), e367–e373. https://doi.org/10.1016/S2589-7500(24)00047-5

Hernström, V., Josefsson, V., Sartor, H., Schmidt, D., Larsson, A. M., Hofvind, S., Andersson, I., Rosso, A., Hagberg, O., & Lång, K. (2025). Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): a randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study. The Lancet Digital Health, 7(3), e175–e183. https://doi.org/10.1016/S2589-7500(24)00267-X

Hickok, M. (2021). Lessons learned from AI ethics principles for future actions. AI Ethics, 1(1), 41–47. https://doi.org/10.1007/s43681-020-00008-1

Higgins, D. C. (2021). OnRAMP for Regulating Artificial Intelligence in Medical Products. Advanced Intelligent Systems, 3(11), 2100042. https://doi.org/10.1002/AISY.202100042

Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. https://doi.org/10.1038/S42256-019-0088-2

Kazim, E., & Koshiyama, A. S. (2021). A high-level overview of AI ethics. Patterns, 2(9). https://doi.org/10.1016/J.PATTER.2021.100314

Kolata G. ChatGPT Defeated Doctors at Diagnosing Illness – The New York Times.  Retrieved March 5, 2025, from https://www.nytimes.com/2024/11/17/health/chatgpt-ai-doctors-diagnosis.html

Kolbinger, F. R., Veldhuizen, G. P., Zhu, J., Truhn, D., & Kather, J. N. (2024). Reporting guidelines in medical artificial intelligence: a systematic review and meta-analysis. Communications Medicine 2024 4:1, 4(1), 1–10. https://doi.org/10.1038/s43856-024-00492-0

Lin, C. S., Liu, W. T., Tsai, D. J., Lou, Y. S., Chang, C. H., Lee, C. C., Fang, W. H., Wang, C. C., Chen, Y. Y., Lin, W. S., Cheng, C. C., Lee, C. C., Wang, C. H., Tsai, C. S., Lin, S. H., & Lin, C. (2024). AI-enabled electrocardiography alert intervention and all-cause mortality: a pragmatic randomized clinical trial. Nature Medicine, 30(5), 1461–1470. https://doi.org/10.1038/S41591-024-02961-4

Liu, X., Rivera, S. C., Moher, D., Calvert, M., Denniston, A., & Group, S.-A. and C.-A. W. (2020). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit Health, 2, e537–e548.

Mittelstadt, B. (2019). Principles alone cannot guarantee ethical AI. Nature Machine Intelligence, 1(11), 501–507. https://doi.org/10.1038/S42256-019-0114-4

Morley, J., Floridi, L., Kinsey, L., & Elhalal, A. (2021). From What to How: An Initial Review of Publicly Available AI Ethics Tools, Methods and Research to Translate Principles into Practices. Philosophical Studies Series, 144, 153–183. https://doi.org/10.1007/978-3-030-81907-1_10

Morley, J., Kinsey, L., Elhalal, A., Garcia, F., Ziosi, M., & Floridi, L. (2023). Operationalising AI ethics: barriers, enablers and next steps. AI and Society, 38(1), 411–423. https://doi.org/10.1007/S00146-021-01308-8

Munn, L. (2022). The uselessness of AI ethics. AI and Ethics 2022 3:3, 3(3), 869–877. https://doi.org/10.1007/S43681-022-00209-W

Polevikov S.The “AI Outperforms Doctors” Claim Is False, Despite NYT Story – A RebuttalRetrieved March 5, 2025, from https://sergeiai.substack.com/p/the-ai-outperforms-doctors-claim

Prem, E. (2023). From ethical AI frameworks to tools: a review of approaches. AI and Ethics, 3(3), 699–716. https://doi.org/10.1007/S43681-023-00258-9

Radiology Health Register.  Retrieved February 27, 2025, from https://radiology.healthairegister.com 

Rajpurkar, P., & Lungren, M. P. (2023). The Current and Future State of AI Interpretation of Medical Images. New England Journal of Medicine, 388(21), 1981–1990. https://doi.org/10.1056/NEJMRA2301725

Rajpurkar, P., Chen, E., Banerjee, O., & Topol, E. J. (2022). AI in health and medicine. Nature Medicine, 28(1), 31–38. https://doi.org/10.1038/S41591-021-01614-0

Rockenschaub, P., Akay, E. M., Carlisle, B. G., Hilbert, A., Wendland, J., Meyer-Eschenbach, F., Näher, A. F., Frey, D., & Madai, V. I. (2025). External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis. BMC Medical Informatics and Decision Making, 25(1), 1–10. https://doi.org/10.1186/s12911-024-02830-7    

Roose, K. Why I’m Feeling the A.G.I. – The New York Times.  Retrieved March 14, 2025, from https://www.nytimes.com/2025/03/14/technology/why-im-feeling-the-agi.html    

Shneiderman, B. (2020). Bridging the gap between ethics and practice: guidelines for reliable, safe, and trustworthy human-centered AI systems. ACM Trans. Interact. Intell. Syst., 10(4), 1–31. https://doi.org/10.1145/3419764

Stucki, G., Rubinelli, S., & Bickenbach, J. (2020). We need an operationalisation, not a definition of health. Disability and Rehabilitation, 42(3), 442–444. https://doi.org/10.1080/09638288.2018.1503730

Vakkuri, V., Kemell, K. K., Jantunen, M., & Abrahamsson, P. (2020). “This is Just a Prototype”: How Ethics Are Ignored in Software Startup-Like Environments. Lecture Notes in Business Information Processing, 383 LNBIP, 195–210. https://doi.org/10.1007/978-3-030-49392-9_13

van Leeuwen, K. G., Schalekamp, S., Rutten, M. J. C. M., van Ginneken, B., & de Rooij, M. (2021). Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. European Radiology, 31(6), 3797–3804. https://doi.org/10.1007/S00330-021-07892-Z

Vasey, B., Nagendran, M., Campbell, B., Clifton, D. A., Collins, G. S., Denaxas, S., Denniston, A. K., Faes, L., Geerts, B., Ibrahim, M., Liu, X., Mateen, B. A., Mathur, P., McCradden, M. D., Morgan, L., Ordish, J., Rogers, C., Saria, S., Ting, D. S. W., … Perkins, Z. B. (2022). Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nature Medicine, 28(5), 924–933. https://doi.org/10.1038/S41591-022-01772-9

WHO. (2021). Generating evidence for artificial intelligence-based medical devices: a framework for training, validation and evaluation. https://www.who.int/publications/i/item/9789240038462

Wong, A., Otles, E., Donnelly, J. P., Krumm, A., McCullough, J., DeTroyer-Cooley, O., Pestrue, J., Phillips, M., Konye, J., Penoza, C., Ghous, M., & Singh, K. (2021). External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine, 181(8), 1065–1070. https://doi.org/10.1001/JAMAINTERNMED.2021.2626

Wu, E., Wu, K., Daneshjou, R., Ouyang, D., Ho, D. E., & Zou, J. (2021). How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine, 27(4), 582–584. https://doi.org/10.1038/S41591-021-01312-X

Zicari, R. V., Brodersen, J., Brusseau, J., Dudder, B., Eichhorn, T., Ivanov, T., Kararigas, G., Kringen, P., McCullough, M., Moslein, F., Mushtaq, N., Roig, G., Sturtz, N., Tolle, K., Tithi, J. J., van Halem, I., & Westerlund, M. (2021). Z-Inspection ® : A Process to Assess Trustworthy AI . IEEE Transactions on Technology and Society, 2(2), 83–97. https://doi.org/10.1109/TTS.2021.3066209

One comment

  1. Brooke Morriswood's avatar
    Brooke Morriswood

    Great posting Uli, always a pleasure (and now I’ve learned about algorithmic aversion and automation bias as well!). I think you’re absolutely right to highlight the overhyping of the technology as the main concern – there is actually nothing wrong with incremental gains, and even a small time benefit for physicians can translate into a large overall improvement overall.

    What’s curious about the AI hype bandwagon is the way its proponents (who presumably haven’t used it for the applications they describe) are so keen to trumpet how it will soon replace the need for human input, while anyone who’s used LLMs for anything serious can immediately tell that human input is absolutely critical. Perhaps the critical thing to remember is how much time it would take for a skilled human to do a particular task – if the addition of AI can help that skilled human do the same task in less time, then we’re onto a good thing for that particular application.

Leave a comment