WER — Word Error Rate


What WER Is

Word Error Rate (WER) is the standard industry metric for measuring how accurately a speech recognition system transcribes spoken words into text. It is expressed as a percentage — lower is better.

WER = (Substitutions + Deletions + Insertions) / Total Words

Where:

  • Substitution — a word was transcribed as a different word (“aspirin” → “aspiran”)
  • Deletion — a word was missed entirely (“patient takes aspirin daily” → “patient takes daily”)
  • Insertion — a word was added that wasn’t spoken (“patient takes aspirin” → “patient takes V aspirin”)

Why WER Is Important for Medical AI

In clinical documentation, a WER of 10% sounds acceptable in casual terms. But consider what 10% means in practice:

  • A 10-minute conversation at ~150 words per minute = ~1,500 words
  • 10% WER = 150 words transcribed incorrectly
  • One wrong drug name, one wrong dosage, one wrong body part — any of these could be clinically significant

For medical ASR, the target WER is generally below 5–10% on clinical conversations, with best-in-class systems hitting 8–13% on in-domain medical benchmarks.

Abridge reports a WER of 12.7% on their internal medical benchmark — meaning roughly 1 in 8 words is wrong in their transcription. This sounds high, but the comparison is against other medical ASR systems, where Abridge achieves a 24% relative reduction in WER.


WER vs. Other Metrics

MetricWhat it measuresBest for
WERWord-level accuracyOverall transcription quality
CERCharacter-level accuracyMedical terms, drug names, dosages
MTRMedical Term RecallWhether critical medical terms are captured correctly
AccuracyCorrect words / total wordsGeneral performance (inverse of WER)

Abridge reports both WER and MTR (Medical Term Recall) of 97% — meaning 97% of medical terms are correctly recalled/transcribed, even when the broader WER is 12.7%.

This distinction matters: WER captures all errors equally, while MTR focuses on whether the clinically important terms (medications, diagnoses, procedures) made it through.


WER in the Abridge Benchmarks

From the Abridge AI Evaluation Whitepaper and Abridge Confabulation Elimination Whitepaper:

  • Abridge internal WER on clinical conversations: 12.7%
  • 24% relative reduction vs. other medical ASR models
  • 83% relative reduction in error on new medications specifically
  • 15% relative improvement on accented English

The 83% reduction on new medications is striking — it means Abridge’s medical fine-tuning specifically improved accuracy on medication names, which are often the most clinically consequential transcription errors.


Limitations of WER

WER treats all words as equally important. A substitution of “ibuprofen” → “iron” is clinically catastrophic; a substitution of “a” → “the” is irrelevant. WER cannot distinguish between these cases.

For medical AI evaluation, WER should be paired with: