I envision a future where everything will be captioned, so the more than 300 million people who are deaf or hard of hearing like me will be able to enjoy videos like everyone else. When I was growing up in Costa Rica, there were no closed captions in my first language, and only English movies had Spanish subtitles. I felt I was missing out because I often had to guess at what was happening on the screen or make up my own version of the story in my head. That was where the dream of a system that could just automatically generate high quality captions for any video was born.
Today I am lucky to be making my dream a reality as part of a team at YouTube exploring innovative ways to make captions more available for everyone. Over the years we have made great strides both in terms of the numbers of videos with captions and also in the accuracy of those captions.
Google first launched video captions back in 2006. Three years later these efforts were taken to a whole new level with automated captions on YouTube. This was a big leap forward to help us keep up with YouTube’s growing scale. Fast forward to today, and the number of videos with automatic captions now exceeds a staggering 1 billion. Moreover, people watch video with automatic captions more than 15 million times per day.
One of the ways that we were able to scale the availability of captions was by combining Google’s automatic speech recognition (ASR) technology with the YouTube caption system to offer automatic captions for videos. There were limitations with the technology that underscored the need to improve the captions themselves. Results were sometimes less than perfect, prompting some creators to have a little fun at our expense!
A major goal for the team has been improving the accuracy of automatic captions — something that is not easy to do for a platform of YouTube’s size and diversity of content. Key to the success of this endeavor was improving our speech recognition, machine learning algorithms, and expanding our training data. All together, those technological efforts have resulted in a 50 percent leap in accuracy for automatic captions in English, which is getting us closer and closer to human transcription error rates.
|Automatic captions example from our previous model
|Automatic captions example from our current model
Continuing to improve the accuracy of captions remains an important goal going forward, as does the need to keep growing beyond 1 billion automatic captions. We also want to extend that work to all of our ten supported languages. But we can’t do it alone. We count on the amazing YouTube community of creators and viewers everywhere. Ideally, every video would have an automatic caption track generated by our system and then reviewed and edited by the creator. With the improvements we’ve made to the automated speech recognition, this is now easier than ever.
I know from firsthand experience that if you build with accessibility as a guiding force, you make technology work for everyone.
Liat Kaver, product manager, recently watched “27 Things You Might Not Know about Star Wars – mental_floss List Show Ep. 450” with automatic caption track.