Latest years have seen an amazing enhance within the creation and serving of video content material to customers internationally in a wide range of languages and over quite a few platforms. The method of making top quality content material can embrace a number of phases from video capturing and captioning to video and audio modifying. In some instances dialogue is re-recorded (known as dialog substitute, post-sync or dubbing) in a studio so as to obtain top quality and substitute authentic audio which may have been recorded in noisy circumstances. Nonetheless, the dialog substitute course of will be tough and tedious as a result of the newly recorded audio must be effectively synced with the video, requiring a number of edits to match the precise timing of mouth actions.
In “Greater than Phrases: In-the-Wild Visually-Pushed Prosody for Textual content-to-Speech”, we current a proof-of-concept visually-driven text-to-speech mannequin, known as VDTTS, that automates the dialog substitute course of. Given a textual content and the unique video frames of the speaker, VDTTS is educated to generate the corresponding speech. Versus commonplace visible speech recognition fashions, which give attention to the mouth area, we detect and crop full faces utilizing MediaPipe to keep away from probably excluding data pertinent to the speaker’s supply. This offers the VDTTS mannequin sufficient data to generate speech that matches the video whereas additionally recovering features of prosody, akin to timing and emotion. Regardless of not being explicitly educated to generate speech that’s synchronized to the enter video, the discovered mannequin nonetheless does so.
![]() |
Given a textual content and video frames of a speaker, VDTTS generates speech with prosody that matches the video sign. |
VDTTS Mannequin
The VDTTS mannequin resembles Tacotron at its core and has 4 predominant elements: (1) textual content and video encoders that course of the inputs; (2) a multi-source consideration mechanism that connects encoders to a decoder; (3) a spectrogram decoder that comes with the speaker embedding (equally to VoiceFilter), and produces mel-spectrograms (that are a type of compressed illustration within the frequency area); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.
We prepare VDTTS utilizing video and textual content pairs from LSVSR during which the textual content corresponds to the precise phrases spoken by an individual in a video. All through our testing, we’ve got decided that VDTTS can’t generate arbitrary textual content, thus making it much less prevalent for misuse (e.g., the era of pretend content material).
High quality
To showcase the distinctive energy of VDTTS on this put up, we’ve got chosen two inference examples from the VoxCeleb2 take a look at dataset and examine the efficiency of VDTTS to a regular text-to-speech (TTS) mannequin. In each examples, the video frames present prosody and phrase timing clues, visible data that isn’t obtainable to the TTS mannequin.
Within the first instance, the speaker talks at a specific tempo that may be seen as periodic gaps within the ground-truth mel-spectrogram (proven under). VDTTS preserves this attribute and generates audio that’s a lot nearer to the ground-truth than the audio generated by commonplace TTS with out entry to the video.
Equally, within the second instance, the speaker takes lengthy pauses between a few of the phrases. These pauses are captured by VDTTS and are mirrored within the video under, whereas the TTS doesn’t seize this facet of the speaker’s rhythm.
We additionally plot basic frequency (F0) charts to check the pitch generated by every mannequin to the ground-truth pitch. In each examples, the F0 curve of VDTTS matches the ground-truth significantly better than the TTS curve, each within the alignment of speech and silence, and likewise in how the pitch modifications over time. See extra authentic movies and VDTTS generated movies.
Video Samples
Unique | VDTTS | VDTTS video-only | TTS |
Unique shows the unique video clip. VDTTS, shows the audio predicted utilizing each the video frames and the textual content as enter. VDTTS video-only shows audio predictions utilizing video frames solely. TTS shows audio predictions utilizing textual content solely. High transcript: “of area for folks to make their very own judgments and to return to their very own”. Backside transcript: “completely love dancing I’ve no dance expertise in any respect however as that”. |
Mannequin Efficiency
We’ve measured the VDTTS mannequin’s efficiency utilizing the VoxCeleb2 dataset and in contrast it to TTS and the TTS with size trace (a TTS that receives the scene size) fashions. We exhibit that VDTTS outperforms each fashions by massive margins in many of the features we measured: larger sync-to-video high quality (measured by SyncNet Distance) and higher speech high quality as measured by mel cepstral distance (MCD), and decrease Gross Pitch Error (GPE), which measures the share of frames the place pitch differed by greater than 20% on frames for which voice was current on each the expected and reference audio.
![]() |
SyncNet distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier). |
![]() |
Mel cepstral distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier). |
![]() |
Gross Pitch Error comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier). |
Dialogue and Future Work
One factor to notice is that, intriguingly, VDTTS can produce video synchronized speech with none specific losses or constraints to advertise this, suggesting complexities akin to synchronization losses or specific modeling are pointless.
Whereas this can be a proof-of-concept demonstration, we consider that sooner or later, VDTTS will be upgraded for use in situations the place the enter textual content differs from the unique video sign. This sort of a mannequin could be a useful software for duties akin to translation dubbing.
Acknowledgements
We want to thank the co-authors of this analysis: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We’re additionally grateful to the valued contributions, discussions, and suggestions from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.