A Country Doctor Reads: Algorithms May Not Work on Diverse Populations in All Cultures - AI Enthusiasts, Take Notice
Catching up on my journals tonight, I found this January article in the New England Journal of Medicine very interesting. Of course, my bias is, just like with self-driving cars, let’s not be too quick to assume computers are always better than humans at judging complex real-life situations.
Lea et al wrote about a hospital in England that in the 1960s created an enormous computer algorithm to evaluate emergency room patients with abdominal pain.
When the….algorithm was tested on roughly 300 patients who presented to the General Infirmary in 1971, the program dazzled. According to the team’s 1972 British Medical Journal report, AAPHelp generated the correct diagnosis in 91.8% of cases, surpassing the performance of senior clinicians. Buoyed by AAPHelp’s impressive performance, de Dombal introduced it to hospitals outside Leeds. But when his group teamed up with researchers at Bispebjerg Hospital in Copenhagen in 1976 to test the system in a fresh clinical environment, its overall accuracy plummeted to 65%.
The problem wasn’t the system’s hardware or software. Instead, it was its data. The population used to develop AAPHelp differed in critical ways from the population in which it was subsequently implemented. First, there was a clinical and epidemiologic mismatch: presentations of the most common causes of abdominal pain varied between the locations. For instance, the clinical spectrum of pancreatitis differed, possibly owing to differing patterns of alcohol use. A taxonomic mismatch compounded this issue: the hospitals classified “acute abdomen” differently. Because of peculiarities of the ways in which patients moved through each hospital, the Leeds dataset excluded patients with salpingitis or urolithiasis, whereas the Copenhagen population included them. Subtler cultural and linguistic differences were also at play, including variations in the ways people described their pain. These incongruities meant that the conditional probabilities underlying AAPHelp were inaccurate for patients in Copenhagen.
The authors conclude:
The question of whether results derived from the study of some patients can be applied to others has long vexed physicians. Authors of the Hippocratic Corpus emphasized personalizing therapeutic regimens. U.S. physicians in the 19th century believed the nature of disease and the efficacy of medical interventions differed significantly between northern and southern states, and even more so between Black people and White people.
The development in the late 19th century of laboratory-based medical science, from histopathology to bacteriology, and the emergence of germ theory took medicine in a new direction. Doctors increasingly saw disease categories and therapeutics as universal.
And here we are, on the one hand going back to personalized medicine and on the other hand thinking that algorithms can be or become sophisticated enough to deliver accurate diagnoses across populations and across cultures...
When will the AI world see that if it's not expressed in some data-scraping form, it doesn't exist? And that the spewers of such manipulated data don't even try to evaluate the bias of the scraped data? I understand data has it's place, but the whole blindness to the source of AI information is appalling.