Web dev at the end of the world, from Hveragerði, Iceland

The risks of OpenAI’s Whisper audio transcription model

This weekend a story from ABC News on issues with audio transcription machine learning models did the rounds.

“Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said”

But Whisper has a major flaw: It is prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the invented text — known in the industry as hallucinations — can include racial commentary, violent rhetoric and even imagined medical treatments.

It took me a couple of days to find the time to properly dig into it and it’s a mixed bag. The report highlights a number of very serious and real issues, but in the process glosses over a few details that might be important.

It mixes anecdotal evidence with studies, that use varying versions of OpenAI’s Whisper, wrapped in a range of different software, using a variety of audio types.

Audio quality, length, number of voices, speech patterns matter a lot, and the longer the recording the more likely it is to have errors

But, there are two core conclusions to take away from it.

The Nabla service itself seems flawed

First, the Nabla service, an audio transcription and summarisation service that targets medical professionals, is specifically using Whisper in a context that OpenAI itself recommends against.

Nabla seems to have what at first glance look like several major design flaws, the biggest one being the automatic deletion of the original audio making verification impossible.

Others have dug into issues with the Nabla service itself and discovered a number of issues.

To summarise:

  • It transcribes using a service specifically not intended for a medical context
  • Then summarises using an LLM to create a statement that’s supposed to go into the patient’s Electronic Medical Record.
  • Then deletes original recordings,
  • The storage of the transcripts and summaries seem iffy in terms of both privacy and regulatory compliance
  • The privacy safeguards as documented seem a bit contradictory.

There’s plenty of reason to be sceptical of the service, even without getting into the issues with OpenAI’s Whisper.

But…

OpenAI’s Whisper model also seems flawed

The second core observation to come out of the report is one from the one study it cites that isn’t just anecdata (PDF), which seems to show a 1-2% hallucination rate depending on speech types.

In the study each audio segment represents, roughly, a sentence. This would mean, according to the study’s results, that about 1 or 2 of every 100 sentences would seem to contain a fabrication.

This explains why individual users will not notice the errors in normal use. 1% is very easy to miss, especially because these models tend towards plausible fabrications, but it could be catastrophic at scale depending on the industry in question.

What’s worrying is the analysis seems to show that a good chunk of the fabrications, or 40%, are outright harmful. The categorisation the study uses are:

  • “Perpetuations of Violence”. Portrayals or implications of violence.
  • “Inaccurate Associations”. Made-up names, relationships, locations, or health statuses.
  • “False Authority”. Hallucinations that misrepresent the speaker source.

That basically means that 1 out of roughly every 200 sentences transcribed contains a harmful fabrication of some sort.

Together, this would mean that if this tech is rolled out widely in sensitive industries such as healthcare, even with some safeguard, it would be very likely to result in the serious harm or even death for a non-trivial number of people.

But the news is not entirely dire for using machine learning for transcription as the researchers ran the study’s tests on competing audio transcription models and those results were very different:

Notably, we found no evidence of hallucinations in competing speech recognition systems such as Google Speech-to-Text (tested in April 2023) or the latest Google Chirp model (tested in December 2023): we identified exactly 0 comparable hallucination concerns (as defined above) from Google’s products out of the 187 identified audio segments. We similarly identified exactly 0 comparable hallucination concerns among the same 187 audio segments from Amazon, Microsoft, AssemblyAI, and RevAI speech-to-text services (tested in January 2024). This could indicate that advancements in generative language models such as PaLM2 (underlying Google Bard) were not being used in a similar manner in competing speech-to-text systems. As such, we believe hallucinations to currently be an OpenAI-specific concern

The phrase “as such, we believe hallucinations to currently be an OpenAI-specific concern” deserves a call-out as I think that might end up being a recurring pattern in the future. Whatever the “AI” industry does, OpenAI seems to be doing with less care, less safety, and more haphazard governance.

You can also find me on Mastodon and Bluesky