Open Source vs. Commercial Speech AI: The AnnoMI Dataset Challenge
Introduction
In many real-world applications involving conversational analytics, understanding who said what during a conversation is crucial. Automatic speech recognition (ASR) systems have improved dramatically in recent years, however many systems benchmark against clean, single-speaker recordings. In the real world, we deal with noisy recordings with multiple speakers who might have different accents. In these scenarios, reliable speaker diarization (attributing speech to the right person) remains a challenge.
In this post, we benchmark three diarization and transcription systems to see how well they capture the flow of real world, multi-speaker recordings:
- AssemblyAI – A commercial ASR API with built-in diarization
- RevAI – Another commercial ASR API with diarization capabilities
- Whisper + Pyannote – An open-source stack combining Whisper (transcription) and Pyannote (voice activity detection and diarization)
Dataset and Setup
We benchmark this experiment against the AnnoMI dataset, which provides 134 video conversations between two English speakers, often in a therapy or medical setting. The dataset contains diverse acoustic conditions, ranging from noisy or distant microphone recordings to clean role plays, with British and Southern US English accents. Some videos contain music, while others contain long periods of silence between speech.
Ground truth annotations include transcript utterances and speaker assignment, enabling a fair assessment of each system’s ability to capture which speaker said what words.
We used a random sample of 33% of the dataset. Any videos containing a narrator (often not included in the ground truth annotation) were removed from the experiment. In our trials, this resulted in a sample of 31 examples. Audio was downloaded directly from the official YouTube collection videos in the dataset.
| Accent | Percentage | Number of Recordings |
|---|---|---|
| American | 71% | 22 |
| British | 16% | 5 |
| Southern US | 13% | 4 |
| Total | 100% | 31 |
Evaluation Approach
Our primary goal was to measure speaker attribution accuracy — how often each utterance was correctly assigned to the speaker who said it. For our use case, the overall flow of the conversation and who said what is more important than exact timings of speech.
To score the results, we used two alignment metrics:
- Overall transcript error: We measure the similarity between the system’s full transcript and the ground truth transcript using the standard Word Error Rate (WER). WER calculates the “edit distance” to change the system’s transcript into the ground truth.
Word Error Rate is a normalized version of the fuzzy distance:
Word Error Rate (WER) = (Substitutions + Insertions + Deletions) / Number of Words Spoken“Who Said What” errors: To specifically evaluate speaker attribution accuracy, we created a new metric we call Who Said What Error Rate (WSWER). For each speaker identified by the system and in the ground truth, we gather all the words they said in order. We then calculate the WER between the system’s word list for a speaker and the ground truth’s word list for the same speaker. Essentially, this is the transcription error rate per speaker to see how well the system captured who said what in the conversation.
Results
Overall Transcript Accuracy
When looking only at the overall transcript accuracy, all three methods perform similarly. RevAI has the slight best median Word Error Rate across the sample but seems more prone to outlier errors, resulting in a higher average score. Overall, median scores are much lower than the average word error rate, indicating that some audio files have extremely poor transcription quality.
Speaker Detection Accuracy
In the ground truth data, all files contain only two speakers. However, all three methods over-detected the number of speakers at some point across our sample:
- AssemblyAI performed best, only marginally over-detecting speakers
- Whisper + Pyannote had middle-of-road performance in speaker detection
- RevAI incorrectly detected more than 2 speakers most frequently
Speaker-level Attribution Accuracy
For speaker attribution (WSWER), AssemblyAI demonstrated the strongest performance, followed by RevAI, with the open-source method showing the highest error rates. The WSWER scores are roughly half the values of the overall transcript distance, which is intuitive given we’re averaging the speaker-level transcript distances per audio file.
Robustness across English accents
Speaker accent plays a big role in both speaker detection accuracy and attribution quality for all methods. This highlights probable bias in the training datasets toward certain regional accents, affecting model performance when processing speakers with less-represented accents.
Surprisingly, most systems did very well on Southern US accents (lower error rates are better). All three systems struggled more with British accents compared to any other type of accent.
AssemblyAI showed the most consistent performance across accent types. However, for British accents, the open-source solution performs the best and appears more robust to British accents. It’s possible that the open source model has more British representation than the paid services’ training sets.
Impact of Audio Length on WER
See (Left) raw data of transcript WER. (Right) Gaussian filter applied to WER values (sigma=2). There is a directional relationship between audio file duration and error rates, and all three methods show similar performance patterns:
- Files under 10 minutes generally have lower (better) error rates
- A sweet spot for performance appears around 600 seconds (10 minutes)
- Error rates increase significantly for longer recordings
Impact of Audio Length on Who Said What
Here we see some differences across the three systems. While we observe the same directional trend that longer audio files experience higher error rates, the open source method appears to be less prone to errors in longer audio. On the other hand, Assembly AI shows the worst performance with longer audio files.
Additionally, there appears to be a sweet spot for good performance across all three methods for audio files which are ~600 seconds (10 minutes). This suggests these systems may be optimized to handle approximately 10-minute samples, possibly reflecting training data characteristics.
The number of audio files in the sample is small and these results may not be significant, but they are directional and for both transcription (WER) and speaker attribution (WSWER), longer audio files consistently demonstrate worse performance.
Key Findings
Several important trends emerged from our analysis:
- AssemblyAI demonstrates reliable speaker diarization, maintaining consistent performance across English accents and recording conditions. It does not always have the lowest raw WER, but it excels at correctly attributing speech to speakers.
- RevAI showed strongest overall transcription accuracy for shorter recordings, but struggled with longer conversations and accurate speaker attribution. It was also more error prone in over-detecting the number of speakers most frequently.
- Whisper + Pyannote demonstrates decent performance, outperforming the paid services in certain accent types and with longer audio files. Overall, it nearly performs at the same caliber as paid services in transcription quality in the grand scheme. However, it is a less robust solution and this affects its overall performance.
- Audio length significantly impacts all systems, with performance degrading as recordings exceed 10 minutes. This suggests potential optimization opportunities for longer conversations.
System Performance Summary
| System | Overall WER (%) | Speaker Attribution (WSWER %) | Speaker Detection Accuracy | Accent Robustness | Accuracy with Longer Audio |
|---|---|---|---|---|---|
| AssemblyAI | 19.8% | 24.5% | High | Low | Low |
| RevAI | 19.1% | 27.5% | Low | Medium | Medium |
| Whisper + Pyannote | 21.6% | 32.6% | Medium | High | High |
While AssemblyAI and RevAI do not often share their WERs, Whisper claims to demonstrate at 9% WER against other datasets such as Common Voice. The results I share here are significantly higher, and no doubt reflects the complexity of real-world data (here, therapy conversations across diverse acoustic settings), and with various accent types.
Conclusions
For teams prioritizing speaker attribution in two-speaker conversations with minimal engineering effort, AssemblyAI shows the strongest out-of-the-box performance. RevAI offers superior transcription for single speaker and shorter recordings, while the Whisper + Pyannote open source combination provides a strong alternative at zero recurring cost for teams who have the engineering resources, and a preference for customization and fine tuning, or potentially plan to explore various accent types.
Future work could explore:
- Improving open-source solutions specifically for multi-speaker conversations
- Testing robustness on noisy recordings with various accents
- Exploring techniques to maintain performance for longer recordings
- Creating more diverse training datasets to better handle regional accent variations
The training data used by these models clearly impacts performance across various English accent types, highlighting the importance of representative and diverse training sets for building truly accessible speech recognition systems.
Methods
For those interested in replicating or building upon this work, our evaluation involved:
- Calculating Word Error Rate (WER) using standard fuzzy text matching algorithms
- Developing a custom Who Said What Error Rate (WSWER) metric by comparing speaker-specific transcript collections
- Analyzing performance across different accents, recording conditions, and audio lengths
- Consistent processing of all 31 test recordings through each system using identical parameters
This provides a realistic assessment of how these systems perform in production environments where both accurate transcription and speaker identification are critical.
The code used for this analysis is available here: https://github.com/mpathic/compare-diarization