Testing Automatic Speech Recognition Software By Dom Bourne
An introduction to the R&D Report
Welcome to Take 1’s first R&D blog. Experimenting with new technology and exploring how we can use it to improve the service we offer has always been a passion of mine, and it’s a key part of my role at Take 1. But while these investigations have been essential to our success over twenty years in the broadcast industry, we’ve never shared our findings – until now. This quarterly blog is designed to keep our clients, colleagues and partners updated about new product developments at Take 1, share research we’re doing into other technology and voice our opinion on industry talking points. We hope you’ll find the content interesting and useful.
This month’s topic : automatic speech recognition
You don’t need me to tell you that AI is a hot topic at the moment, and automatic speech recognition is an application of AI that has the potential to fundamentally change how we produce transcriptions for broadcast. Recent developments in the quality of ASR outputs have already seen a growth in adoption of this technology across the industry, particularly when it comes to generating subtitles for live and online video content.
While we embrace using (and have developed our own) technology to introduce productivity gains in transcription workflows, to date machine-generated transcriptions have failed to meet the high-quality standards our clients rely on Take 1 to deliver. But the recent improvements in ASR products offered by companies like Amazon transcribe, IBM Watson and Speechmatics (who have recently boasted a twenty percent increase in accuracy) prompted us to question whether we’ve finally reached the point where a machine output with a human post-edit might now be a better solution than transcribing from scratch.
There’s no question that technology should be integrated, but the question is whether the machine output is a better solution?
The ASR question
This is not the first time we’ve tested AI technology. In recent years, Take 1 has experimented with various ASR engines to see if they can provide operational efficiencies in creating transcriptions and captioning, and we’re very familiar with the cloud workflows necessary to use these engines. Previous tests have indicated that, while machines show promise in deciphering clearly spoken dialogue from a single speaker, the resulting transcripts lacked timecode information and didn’t yet meet the quality standards necessary for broadcast requirements.
For this test we worked with one of the leading producers and broadcasters of unscripted content to test whether current ASR technologies can play a part in their transcription workflows, by either reducing the cost of creating the transcripts required to support their edits, or by providing an alternative, lower cost transcription output for content that isn’t usually transcribed.
The ASR test
A variety of different content was used for the test, including;
- One-to-one interviews
- Multi-voice interviews
- On-the-fly monologues
Each piece of content was put through three different processes. We then compared the time, effort and end-product from each of these processes;
- Human only transcription
- Raw ASR output (no human post-edit)
- ASR output + human post-edit
All ASR outputs were created using the same best-of-breed, automatic speech recognition software – chosen because it is the most popular in the industry and is recognised for providing the highest quality output.
The raw ASR output compared to human transcription
While our tests showed that ASR transcription of a monologue was five times quicker than the manual transcription of the same content, with this increasing to eight times for a multi-voice interview and twelve times for a single person interview, the quality of the ASR product varied dramatically. Highly accurate in part, the ASR transcriptions also included some very inaccurate words, phrases and sentences, particularly when more than one speaker was featured in the clip. The lack of punctuation, timecodes and correct speaker labels meant that a lot of post editing would be required to produce an accurate transcription that includes all the information broadcast clients need.
ASR + human post-edit compared to human transcription
Our tests did show that there are situations where it may be quicker to post-edit an ASR-generated transcription than it is to produce the same content using a human workforce. The single person interview took 128 minutes for our transcribers to produce from scratch, but it took only 100 minutes to create and edit an ASR script for the same content. However, for both the multi-voice interview and monologue clips, the opposite was true, with the ASR + human post-edit workflow taking longer than an entirely manual process. It was evident that, where clips included multiple voices or background noise, it was quicker to produce a human transcription from scratch than it was to create and clean up an ASR output.
At first glance these results might indicate that it’s more efficient to create transcripts for specific content using ASR technology with a human post-edit workflow. However, it is important to recognise that post-editors demand a higher hourly rate than transcribers, as this task requires a higher cognitive load. So, whilst an ASR + human post-edit workflow could produce slight time savings for certain content, this doesn’t translate to cost savings because the higher hourly cost negates any time savings.
Furthermore, because the ASR output is delivered as a large block of unformatted text, additional time is needed to manually convert this into ScriptSync compatible or other layouts required by broadcasters. By comparison, transcribing from scratch into Take 1’s Liberty software means that the text is entered as XML data, which can be easily reformatted into multiple outputs at the touch of a button.
The future of automatic speech recognition for video transcription
While this test revealed that Take 1’s current process for producing transcriptions and as-broadcast scripts from scratch is still more efficient than using automatic speech recognition software with a human post-edit, it also indicated that the gap is closing.
In addition, while current broadcast and production workflows demand the creation of highly-accurate and detailed transcriptions, there may be other opportunities to use transcripts that don’t require this same level of detail and, therefore, require less human editing. For example, if timecodes and speaker labels weren’t required, or if 100% accuracy was not necessary, then creating transcriptions using ASR with a lower degree of human polishing could be significantly more efficient.
The Take 1 R&D team will continue testing ASR and exploring how we can harness this technology to improve both our, and our clients’, workflows. We’ll keep you posted!