How Reliable Is Automatic Speech Recognition for Orthodontic Records?
Electronic health records are now routine for clinical record-keeping. Many of us still use a keyboard to enter patient data. However, with the development of automatic speech recognition (ASR), several packages have become available for use in healthcare. Nevertheless, there are challenges in interpreting clinical speech into text. This new paper examined the accuracy of automatic speech recognition in orthodontic clinical records. It is relevant to all dental healthcare providers.
This is not a post about a clinical subject; an established orthodontic research team conducted this study. I, therefore, thought it was relevant to my blog and was certainly a change from looking at “airway” papers.
A team based in the beautiful South of England and Zurich conducted this study. The Journal of Dental Research published the paper. This paper is open access.

Transcription Accuracy of Automatic Speech Recognition for Orthodontic Clinical Records
R OKane et al
Journal of Dental Research, DOI: 10.1177/00220345251382452
I have a conflict of interest as I know this research team well. Martyn Cobourne and I come from a small village in the rural county of Worcestershire and went to the same schools and played in the same park when we were children.
What did they ask?
They did this study to answer this question.
“What is the transcriptional accuracy of ASR systems in dentistry using narrated orthodontic clinical records?”
What did they do?
They carried out a cross-sectional study with the following stages:
They identified 10 distinct automatic speech recognition systems for orthodontic clinical records. Four of these were commercially available systems. For example, Dragon Medical 1.
The second category was standalone speech-to-text systems that provided direct access to widely available automatic speech recognition models. Examples included GPT-4.0 Transcribe and Whisper OpenAI.
The third category was an experimental ASR, augmented by natural language processing and Large Language Models (LLM) that use generative error correction. This was called GPT40Transcribe Corrected.
They then dictated from prepared orthodontic clinical records, including diagnoses and treatments, to generate transcripts using various ASR systems.
Interestingly, they evaluated all systems in the presence or absence of background noise and across variations in narrator accents.
They then assessed each system for transcriptional, lexical, and semantic accuracy using validated word- and character-error metrics. The primary outcome was the Domain Word Error Rate (DWER), which assesses transcription accuracy with respect to clinical terminology.
What did they find?
They produced a large amount of data. There were significant differences between the systems in terms of DWER.
They found that GPT4o TranscribeCorrected outperformed all other systems, achieving a DWER of only 3.47%. GTP40 Transcribe and Heidi Health were consistently ranked second and third best, with DWERs of 7.6% and 6.1%, respectively.
Interestingly, the commercially available systems did not perform as well. For example, Dragon Professional Anywhere ranked worst across all transcription error metrics, with a DWER of 48%, and Dragon Medical One ranked second-worst, with a DWER of 29%.
With the exception of GPT4o and Transcribe Corrected, the systems had considerable difficulty recognising domain-specific words. They also found that background noise increased the word error rate. However, this effect was system-dependent, with the two GPT4o variants and Heidi Health showing the greatest resilience. They also found that speaker accent had only a minor influence.
The authors highlighted several orthodontic terms that were mistranscribed across the systems. They included these in a very useful table. I do not have space to include them all in this post. However, it was interesting to see that terms such as “Essix retainer” were consistently transcribed as “Essex”, the county, and that terms such as “mesially” were transcribed as “nasally”, “easily”, and “measly”, with a 75% mistranscription rate across all systems. (This is exactly what happened when I was dictating this post into Whispr Flow!)
Their final conclusion was
“There was significant performance variability amongst the tested ASR systems. All were capable of introducing clinically significant mistranscriptions. We need to be cautious about using this technology at the moment, as it requires considerable checking.”
What did I think?
I’ve always messed about with computers and their technology since the early days of the personal computer. In the past, I have tried many dictation systems, and they were all abject failures. Recently, I have been using WISPR Flow to dictate this blog, and it appears to be an excellent package. I was therefore very interested to see this study.
The authors did the study well and wrote a clear paper.. The Journal of Dental Research published the paper. This is a difficult journal to get a paper accepted in. I have never managed to achieve this.
I found the findings interesting and pointed us toward using various packages.
This is clearly going to be a very fast-moving field. I’m not sure which package to use; however, it is important to note that the non-clinical packages were surprisingly effective compared with those developed for clinical use.
We also need to consider what an acceptable error rate is. This is rather difficult because the consequences of different errors are likely to vary. For example, misidentifying a tooth for extraction is likely to have greater consequences than a mistake in recording a molar relationship.
The authors of the paper also highlighted the risk of “hallucinations”. They explained these as “a type of output error where the model generates fluent, coherent text that is entirely unrelated to or ungrounded in the source audio input”. The fabricated transcriptions often appear convincing but do not match the actual spoken content. In this study, they found hallucinations that included invented discussions on tooth restorations or alternative treatments. It is crucial that we identify these, as they have the potential to cause confusion or harm.”
This is an important paper for all clinicians. I look forward to other developments in this fast-moving field.

Emeritus Professor of Orthodontics, University of Manchester, UK.
I work with a couple of clinicians who use dictation programmes. I find they tend to produce very long and wordy posts which can be difficult to read through and find the key info. I also notice that they do not always read through their posts before saving as errors do appear (confirmed by this paper)
Clinical posts are getting longer as we ensure we’ve recorded everything we’ve discussed to guard against litigation, but I feel this can be counterproductive if key points are getting lost in a sea of words.