We use tools that rely on artificial intelligence (AI) every day, with voice assistants like Alexa and Siri being among the most common. These consumer products work reasonably well—Siri understands most of what we say—but they are by no means perfect. We accept their limitations and adapt how we use them until they get the right answer, or we give up. After all, the consequences of Siri or Alexa misunderstanding a user request are usually minor.

However, mistakes by AI models that support doctors’ clinical decisions can mean life or death. Therefore, it’s critical that we understand how well these models work before deploying them. Published reports of this technology currently paint a too-optimistic picture of its accuracy, which at times translates to sensationalized stories in the press. Media are rife with discussions of algorithms that can diagnose early Alzheimer’s disease with up to 74 percent accuracy or that are more accurate than clinicians. The scientific papers detailing such advances may become foundations for new companies, new investments and lines of research, and large-scale implementations in hospital systems. In most cases, the technology is not ready for deployment.

Here’s why: As researchers feed data into AI models, the models are expected to become more accurate, or at least not get worse. However, our work and the work of others has identified the opposite, where the reported accuracy in published models decreases with increasing data set size.

The cause of this counterintuitive scenario lies in how the reported accuracy of a model is estimated and reported by scientists. Under best practices, researchers train their AI model on a portion of their data set, holding the rest in a “lockbox.” They then use that “held-out” data to test their model for accuracy. For example, say an AI program is being developed to distinguish people with dementia from people without it by analyzing how they speak. The model is developed using training data that consist of spoken language samples and dementia diagnosis labels, to predict whether a person has dementia from their speech. It is then tested against held-out data of the same type to estimate how accurately it will perform. That estimate of accuracy then gets reported in academic publications; the higher the accuracy on the held-out data, the better the scientists say the algorithm performs.

And why does the research say that reported accuracy decreases with increasing data set size? Ideally, the held-out data are never seen by the scientists until the model is completed and fixed. However, scientists may peek at the data, sometimes unintentionally, and modify the model until it yields a high accuracy, a phenomenon known as data leakage. By using the held-out data to modify their model and then to test it, the researchers are virtually guaranteeing the system will correctly predict the held-out data, leading to inflated estimates of the model’s true accuracy. Instead, they need to use new data sets for testing, to see if the model is actually learning and can look at something fairly unfamiliar to come up with the right diagnosis.

While these overoptimistic estimates of accuracy get published in the scientific literature, the lower-performing models are stuffed in the proverbial “file drawer,” never to be seen by other researchers; or, if they are submitted for publication, they are less likely to be accepted. The impacts of data leakage and publication bias are exceptionally large for models trained and evaluated on small data sets. That is, models trained with small data sets are more likely to report inflated estimates of accuracy; therefore we see this peculiar trend in the published literature where models trained on small data sets report higher accuracy than models trained on large data sets.

We can prevent these issues by being more rigorous about how we validate models and how results are reported in the literature. After determining that development of an AI model is ethical for a particular application, the first question an algorithm designer should ask is “Do we have enough data to model a complex construct like human health?” If the answer is yes, then scientists should spend more time on reliable evaluation of models and less time trying to squeeze every ounce of “accuracy” out of a model. Reliable validation of models begins with ensuring we have representative data. The most challenging problem in AI model development is the design of the training and test data itself. While consumer AI companies opportunistically harvest data, clinical AI models require more care because of the high stakes. Algorithm designers should routinely question the size and composition of the data used to train a model to make sure they are representative of the range of a condition’s presentation and of users’ demographics. All datasets are imperfect in some ways. Researchers should aim to understand the limitations of the data used to train and evaluate models and the implications of these limitations on model performance.

Unfortunately, there is no silver bullet for reliably validating clinical AI models. Every tool and every clinical population are different. To get to satisfactory validation plans that take into account real-world conditions, clinicians and patients need to be involved early in the design process, with input from stakeholders like the Food and Drug Administration. A broader conversation is more likely to ensure that the training data sets are representative; that the parameters for knowing the model works are relevant; and what the AI tells a clinician is appropriate. There are lessons to be learned from the reproducibility crisis in clinical research, where strategies like pre-registration and patient centeredness in research were proposed as a means of increasing transparency and fostering trust. Similarly, a sociotechnical approach to AI model design recognizes that building trustworthy and responsible AI models for clinical applications is not strictly a technical problem. It requires deep knowledge of the underlying clinical application area, a recognition that these models exist in the context of larger systems, and an understanding of the potential harms if the model performance degrades when deployed.

Without this holistic approach, AI hype will continue. And this is unfortunate because technology has real potential to improve clinical outcomes and extend clinical reach into underserved communities. Adopting a more holistic approach to developing and testing clinical AI models will lead to more nuanced discussions about how well these models can work and their limitations. We think this will ultimately result in the technology reaching its full potential and people benefitting from it.

The authors thank Gautam Dasarathy, Pouria Saidi and Shira Hahn for enlightening conversations on this topic. They helped elucidate some of the points discussed in the article.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.