Adding Benchmaxxer Repellant to the Open ASR Leaderboard
Hugging FaceArchived May 09, 2026✓ Full text saved
Full text archived locally
✦ AI Summary· Claude Sonnet
Back to Articles
Adding Benchmaxxer Repellant to the Open ASR Leaderboard
Published May 6, 2026
Update on GitHub
Eric Bezzam
bezzam
Follow
Steven Zheng
Steveeeeeeen
Follow
Eustache Le Bihan
eustlb
Follow
Sergio Bruccoleri
SBruccoleriAppen
Follow
AppenAIResearch
Jeanine Sinanan-Singh
jmss-appen
Follow
AppenAIResearch
Casey Ford
c-e-ford-appen
Follow
AppenAIResearch
Guanbo Wang
wgb14
Follow
DataoceanAI1
Yukai Huang
YukaiHuang
Follow
DataoceanAI1
Ke Li
like2026
Follow
DataoceanAI1
Yufeng Hao
logicbean
Follow
DataoceanAI1
Liao Xiaoling
ally-lxl
Follow
DataoceanAI1
"When a measure becomes a target, it ceases to be a good measure." (Goodhart’s Law)
TLDR: Appen Inc. and DataoceanAI have provided high-quality English ASR datasets covering scripted and conversational speech over multiple accents. To prevent potential risks of benchmaxxing or test-set contamination, we will keep these datasets private for a high-quality measure of performance on multiple tasks.
We’re not updating the average WER at this time: by default, the leaderboard’s Average WER remains computed on public datasets only. You can optionally include the private datasets using the toggle to see their impact 👀
Since its launch in September 2023, the Open ASR Leaderboard has been visited over 710K times. We’re blown away by the community’s interest and motivation to keep pushing speech recognition 🗣️
Two words sum up the objectives (but also challenges) in maintaining a benchmark like the Open ASR Leaderboard:
Standardization: models can have different conventions for their usage and outputs, e.g. with/without punctuation and casing. Datasets have the same challenges and can be structured differently. To this end, all test sets have been gathered into a single dataset on the Hub for easy access and previewing. Moreover, to standardize model outputs and dataset transcripts, we use a normalizer that (among other things) removes punctuation and casing, and maps to American spelling. It is based on the normalizer of Whisper.
Openness: the UI code and evaluation scripts are open-sourced. This has helped not only to incorporate new models, but also to improve the quality of the evaluation procedure through community feedback and contributions.
Standardization and openness are essential for meaningful benchmarking, but they also make benchmarks more susceptible to benchmark-specific optimization ("benchmaxxing"), where models improve leaderboard performance without corresponding gains in real-world robustness. As models and use cases evolve, the Open ASR Leaderboard will continue incorporating high-quality datasets and new evaluation settings to better reflect real-world performance and improve robustness against benchmark-specific optimization.
As discussed in our report, there is no single "catch-all" ASR model: some perform better on American English, others on diverse accents and multilingual settings, while others are optimized for speed or conversational audio. Different applications also prioritize different capabilities, so a model that performs less well on one dimension is not necessarily a worse model overall. The goal of the Open ASR Leaderboard is to capture these nuances and provide a more holistic view of ASR performance.
New high-quality, private datasets
To this end, we have worked with Appen Inc. and DataoceanAI to curate high-quality datasets for ASR benchmarking. Below is some information on the various splits.
Dataset Accent Duration [h] Male (%) / Female (%) Style Transcription
Appen Scripted AU Australian 1.42 49 / 51 Read Punctuated, cased.
Appen Scripted CA Canadian 1.53 52 / 48 Read Punctuated, cased.
Appen Scripted IN Indian 1.02 49 / 51 Read Punctuated, cased.
Appen Scripted US American 1.45 49 / 51 Read Punctuated, cased.
Appen Conversational IN Indian 1.37 51 / 49 Conversational, spontaneous Punctuated, disfluencies.
Appen Conversational US003 American 1.64 49 / 51 Conversational, spontaneous Punctuated, cased, disfluencies.
Appen Conversational US004 American 1.65 49 / 51 Conversational, spontaneous Punctuated, disfluencies.
DataoceanAI Scripted US American 2.43 54 / 46 Read Punctuated, cased (proper nouns), disfluencies.
DataoceanAI Scripted GB British 2.43 47 / 53 Read Punctuated, disfluencies.
DataoceanAI Conversational US American 8.82 NA Conversational, spontaneous Punctuated, disfluencies.
DataoceanAI Conversational GB British 5.96 NA Conversational, spontaneous Punctuated, disfluencies.
Below are sample audio showing the variety of content (scripted, conversational, acronyms, disfluencies, proper nouns).
While private datasets may sound contrary to the spirit of openness, we believe that incorporating such datasets will increase the trustworthiness of the Open ASR Leaderboard, as they are less likely to be exploited for benchmaxxing, whether by model developers who explicitly use the public test sets or who try to find training data that closely resembles a particular dataset to boost their score in the macroaverage.
With these datasets, we can also provide targeted metrics to highlight gaps and biases between controlled and often saturated settings (scripted, American accent) and more nuanced conditions (conversational and non-American accents). Below is a screenshot of the new "Private data" tab.
Below is how each column is computed.
"Average WER" computes a macroaverage of the data provider averages, so that they are weighted equally.
"Avg Scripted" performs a macroaverage of all scripted datasets.
"Avg Conversational" performs a macroaverage of all conversational datasets.
"Avg US" performs a macroaverage of all datasets with American accents.
"Avg non-US" performs a macroaverage of all datasets with non-American accents.
We intentionally do not provide a score on each split, to avoid model developers from boosting their score with a specific data provider or accent.
How can I evaluate my model on this data?
Get your model on the Open ASR Leaderboard, and we'll run the evaluation! As before, the process for adding a model to the leaderboard takes place on the Open ASR Leaderboard GitHub:
Open a pull request, and a model checklist will appear. As before, you should report your results on the public datasets.
We will verify the results on the public sets and compute the metrics on the private ones.
Confirm the results we’ve obtained.
While you wait for your model to be added to the Open ASR Leaderboard, you can self-report your metrics on the public sets by adding a YAML file like this to your model card. Your model will then appear on an (unverified) leaderboard that appears on the dataset page (see screenshot below). More on this approach to decentralized evaluation can be read here.
Do models trained on the data providers have an advantage?
They could. We’ve asked Appen and DataoceanAI to not provide this data to their clients. But even if they do not provide this exact data, data from a similar distribution could still help the model on the corresponding evaluation set (similar to benchmaxxing by optimizing for a challenging task from the public sets). To this end, having multiple data providers balances out the advantage a model may get from having used data from one of the providers. And we are open to more data providers and eval sets for the "Private data" tab!
Moreover, to ensure that the private sets do not affect the model ranking, we’ve defaulted the Average WER to not include the Private sets in its macroaverage.
In the screenshot below, you can see that "Private data" is toggled off. This means that the macroaverage across datasets does not include it.
Simply toggle on "Private data" splits to include them in the macroaverage.
The "Rank Δ" column shows how the ordering changes relative to the default macroaverage configuration. Including or excluding public datasets also changes the macroaverage, allowing users to tailor the evaluation to the use cases and data distributions most relevant to their application.
What's next?
We’re excited to hear the community’s feedback on how the new track and dataset toggling features help users identify the model(s) that best fit their application(s). We’re also looking into evaluations that better reflect real-world noisy conditions, and you can expect some news on that 😉
While preparing the private evaluation sets, we took extra care to ensure consistent audio and transcript quality across datasets, including developing tooling to identify challenging cases such as low signal-to-noise conditions or transcript mismatches, since these factors can meaningfully affect WER. More on that in a future post!
More Articles from our Blog
Audio
Speech
Leaderboard
Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
27
November 20, 2025
Ethics
Guide
Speech
Voice Cloning with Consent
meg, frimelle
39
October 27, 2025
Community
weiwchu
3 days ago
Just a quick observation on how we evaluate our models: The two data vendors mentioned earlier are offering datasets directly to ASR service providers for training. With that in mind, it feels like we should be extra prudent about mixing private data into our evaluations.
At the same time, we're missing out on a lot of high-quality, open-source speech datasets simply because their creators don't have sales teams advocating for them. It would be a huge step forward to open a channel where the research community can recommend these datasets to the leaderboard directly.
P.S. I found myself relating a lot to King George in the movie Hoppers recently. That beaver king just believes the best about everyone. I want to bring that same optimism here—hoping that everyone continues to follow the leaderboard rules, avoids training on open-source test sets, and actively cleans their training data of any contamination.
Wei Chu
Researcher
wei@olewave.com
See translation
1 reply
·
bezzam
Article author
2 days ago
@weiwchu thank you for your comments 🙂
The two data vendors mentioned earlier are offering datasets directly to ASR service providers for training. With that in mind, it feels like we should be extra prudent about mixing private data into our evaluations.
Completely agree, that is why we don't include these new datasets in the default average WER computation. We hope that users have this nuanced view of the data sources. But also on the types of content, which is why we added splits on scripted/conversational and American/non-American accents.
It would be a huge step forward to open a channel where the research community can recommend these datasets to the leaderboard directly.
This is possible on our GitHub repo! This checklist describes how a new model or dataset can be contributed.
See translation
Edit
Preview
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Comment
· Sign up or log in to comment
Upvote
11
Models mentioned in this article
1
Datasets mentioned in this article
1