Should text-to-speech be used to create audio descriptions for TV?
While most of us are accustomed to seeing subtitled or signed video content (and may even rely on subtitles to watch videos when we’re commuting) we’re unlikely to come across audio-described content in our everyday lives. But, for the 217 million people living with moderate to severe visual impairment and the 36 million blind people around the world, audio descriptions make video content accessible by narrating what is happening on the screen during natural pauses in the dialogue.
The problem with audio descriptions for TV.
Until recently audio described video content hasn’t been widely accessible on TV. The BBC launched its audio description service in 2000 but only 10% of television content in the UK is required to provide AD compared to around 80% for subtitling. Ofcom have only just made recommendations for similar quotas for video-on-demand services. In the US, the FCC only started enforcing video description quotas for broadcasters in 2012 and Netflix first started producing AD for their original content in June 2015.
Some content distributors, like the BBC and Netflix, now provide more accessible content than they’re obliged to as they’ve recognised that this content provides an opportunity to grow their audience share. But other broadcasters and networks have been slower to embrace audio-described content in particular. And cost is probably the most significant reason for this.
Writing descriptions that provide the necessary information, in a succinct way, that fits the programme style requires a specialised skillset that comes at a cost. Add to this the fees for recording studios, voice-over talent and post-production and you start to understand why there isn’t more AD video content available.
What is text-to-speech and how can it help?
Text-to-speech converts written text into a synthesised spoken voice, typically by cobbling together frequently used sounds to make words. It’s nothing new – one of the first commercial applications of TTS, the Kurzweil reading machine, was developed in the late 1970’s – but the technology has only recently become accessible for wide use thanks to the developments in artificial intelligence. Now, the reduced cost and improved quality of TTS outputs has prompted the launch of a number of new AI voice synthesizers, like Google’s Cloud Text-to-Speech, that have the potential to significantly reduce the costs associated with producing audio descriptions.
But is text-to-speech really the answer to all our AD problems?
The case for synthesised speech.
With no need for expensive recording studios and no talent fees, using a synthesised voice-over will undoubtedly be more cost-effective than recording with a real artist. And the savings don’t end with the actual recordings, using text-to-speech means that there’s no management requirement to co-ordinate voice-over artists’ and recording studios’ availability, the service is available 24/7, 365 days a year and there are no contracts or renewal fees to track and budget for. Changes or additions don’t necessarily impact your budget and synthesised voices can be sped up, pitch changed and saved into a number of different formats at the touch of a button.
All of this should mean that broadcasters and networks will be able to produce more audio described content within the same budget, making more content accessible to visually impaired and blind audiences. How can that be a bad thing?
The case against Text-To-Speech for TV.
While text-to-speech technology might offer cost benefits, some of the perceived savings are a false economy. It takes longer to prepare scripts for synthesised voice recordings because writers need to spell words phonetically and, because automated systems can’t dynamically adjust the speed of their delivery to fit into the available time, clipping becomes necessary.
There are also a number of practical and creative drawbacks to this approach to access services. Perhaps the most significant obstacle is the limited availability of languages and voices available in synthesised speech – for example, Google’s Cloud Text-to-Speech currently offers 100 different voices across 20 different languages, which may sound like a lot until you consider that there are somewhere between 6500 and 7100 living languages in the world. By contrast, using a global base of human voice-over artists means that audio descriptions can not only be produced in the appropriate language, but that age, accent and vocal qualities can be matched to the programme content to produce a coherent experience.
Synthetic voices also can’t compete with human artists when it comes to appropriate delivery and dramatic capability. If every element of video production – from wardrobes to set dressing, make-up, lighting and camera angles are so carefully considered for viewing audiences is it fair to offer visually impaired audiences a sub-standard experience by compromising on the quality of audio descriptions?
This debate is, perhaps, more complicated than it initially seems.
Let us know what you think and get in touch with Take 1 to find out how we can help you solve your audio description challenges.