By Seamus McAteer – Jan 3, 2024
Automated dubbing employs several technologies to convert source audio or video into text, translate the resulting transcript, and generate speech from the translation. However, constructing an integrated system involves significantly more than just assembling these components. Challenges arise with background audio, non-verbal vocals, the inclusion of singing, and aligning the timing of the output with the source among other issues.
Training so-called end-to-end models for dubbing remains experimental. In 2021, Google research teams showcased Translatotron, a speech-to-speech model. Since then, no commercial systems using this approach have been launched, and there has been a scarcity of new published research in the field.New models for automated speech translation combine speech recognition with machine translation to produce translated text output. For instance, OpenAI's open-source speech recognition model, Whisper, supports automated translation to English from dozens of languages. While these models' performance will likely improve over time, the potential for compounded errors and challenges in editing and aligning the text limits their use as part of an enterprise dubbing platform.
Automated dubbing employs cascading models that rely on separate elements for transcribing the source audio, translating the resultant text, and generating audio output from the translation. Building such a system in practice requires the use of multiple AI models for:
Source audio separation: This involves separating the vocal and background audio tracks. Background audio may include music, environmental sounds, or crowd noise. A dub needs to retain background audio for a natural sound.
Segmentation and Speaker Labeling: AI models segment the vocal track based on pauses, changes in speaker, and tone shifts. A separate model clusters segments based on characteristics of the speakers’ voices, assigning distinct labels to these clusters.
Speech Recognition: Automated speech recognition (ASR) generates written text from vocal audio. Models like Whisper can produce hallucinations and artifacts, which may be addressed using a number of heuristics during processing.
Translation: Neural machine translation (MT) is reasonably mature. The quality of translation is influenced by the punctuation and segment length of the source transcript. New models may combine neural MT for accuracy with large language model (LLM) technology for more fluent contextual translation.
Speaker Voice Generation: Text-to-speech models may use a limited number of pre-trained voices or zero-shot cloning, where voices are dynamically generated using less than 10 seconds of the source speaker’s or another speaker’s voice.
Text-to-Speech:The TTS model generates voice output from the translated text. Text normalization rules ensure proper pronunciation of acronyms, numbers, and other specific words. Code-switching enables accurate pronunciation for proper nouns or technical terms that retain their original pronunciation.
These AI models represent just a portion of the technologies used in an automated dubbing system. A product must also accommodate human review and editing. For natural-sounding output, a system should automate the pacing of speech output and integrate other audio elements from the original video.
Speech recognition and AI-derived translations require proofing, editing, and any service should enable manual review and revisions. Collaborative editing with control for access privileges, APIs, and methods for integration with other platforms used for transcription and translation are essential elements in any enterprise-class dubbing platform.
Aligning the timing of the dub with the source is critical for most video content. Issues related to timing are influenced by verbosity differences across languages. For example, Spanish translations tend to be about 25% longer than the corresponding English content. Merging or separating segments, supporting alternate speaker speeds, and using AI to derive alternative translations are some methods used to optimize timing alignment.
Once generated, the dubbed vocal track is integrated with elements in the original video, including background audio and music, and any non-verbal vocals such as screaming or laughter. In some cases, portions of the original verbal vocal track may be retained, for example, singing or rapping will be processed by speech recognition but marked for retention in the dub.