Google AI Weblog: VDTTS: Visually-Pushed Textual content-To-Speech


Latest years have seen an amazing enhance within the creation and serving of video content material to customers internationally in a wide range of languages and over quite a few platforms. The method of making top quality content material can embrace a number of phases from video capturing and captioning to video and audio modifying. In some instances dialogue is re-recorded (known as dialog substitute, post-sync or dubbing) in a studio so as to obtain top quality and substitute authentic audio which may have been recorded in noisy circumstances. Nonetheless, the dialog substitute course of will be tough and tedious as a result of the newly recorded audio must be effectively synced with the video, requiring a number of edits to match the precise timing of mouth actions.

In “Greater than Phrases: In-the-Wild Visually-Pushed Prosody for Textual content-to-Speech”, we current a proof-of-concept visually-driven text-to-speech mannequin, known as VDTTS, that automates the dialog substitute course of. Given a textual content and the unique video frames of the speaker, VDTTS is educated to generate the corresponding speech. Versus commonplace visible speech recognition fashions, which give attention to the mouth area, we detect and crop full faces utilizing MediaPipe to keep away from probably excluding data pertinent to the speaker’s supply. This offers the VDTTS mannequin sufficient data to generate speech that matches the video whereas additionally recovering features of prosody, akin to timing and emotion. Regardless of not being explicitly educated to generate speech that’s synchronized to the enter video, the discovered mannequin nonetheless does so.

Given a textual content and video frames of a speaker, VDTTS generates speech with prosody that matches the video sign.

VDTTS Mannequin

The VDTTS mannequin resembles Tacotron at its core and has 4 predominant elements: (1) textual content and video encoders that course of the inputs; (2) a multi-source consideration mechanism that connects encoders to a decoder; (3) a spectrogram decoder that comes with the speaker embedding (equally to VoiceFilter), and produces mel-spectrograms (that are a type of compressed illustration within the frequency area); and (4) a frozen, pretrained neural vocoder that produces waveforms from the mel-spectrograms.

The general structure of VDTTS. Textual content and video encoders course of the inputs after which a multisource consideration mechanism connects these to a decoder that produces mel-spectrograms. A vocoder then produces waveforms from the mel-spectrograms to generate speech as an output.

We prepare VDTTS utilizing video and textual content pairs from LSVSR during which the textual content corresponds to the precise phrases spoken by an individual in a video. All through our testing, we’ve got decided that VDTTS can’t generate arbitrary textual content, thus making it much less prevalent for misuse (e.g., the era of pretend content material).

High quality

To showcase the distinctive energy of VDTTS on this put up, we’ve got chosen two inference examples from the VoxCeleb2 take a look at dataset and examine the efficiency of VDTTS to a regular text-to-speech (TTS) mannequin. In each examples, the video frames present prosody and phrase timing clues, visible data that isn’t obtainable to the TTS mannequin.

Within the first instance, the speaker talks at a specific tempo that may be seen as periodic gaps within the ground-truth mel-spectrogram (proven under). VDTTS preserves this attribute and generates audio that’s a lot nearer to the ground-truth than the audio generated by commonplace TTS with out entry to the video.

Equally, within the second instance, the speaker takes lengthy pauses between a few of the phrases. These pauses are captured by VDTTS and are mirrored within the video under, whereas the TTS doesn’t seize this facet of the speaker’s rhythm.

We additionally plot basic frequency (F0) charts to check the pitch generated by every mannequin to the ground-truth pitch. In each examples, the F0 curve of VDTTS matches the ground-truth significantly better than the TTS curve, each within the alignment of speech and silence, and likewise in how the pitch modifications over time. See extra authentic movies and VDTTS generated movies.

We current two examples, (a) and (b), from the VoxCeleb2 take a look at set. From prime to backside: enter face photos, ground-truth (GT) mel-spectrogram, mel-spectrogram output of VDTTS, mel-spectrogram output of a regular TTS mannequin, and two plots displaying the normalized F0 (normalized by imply non-zero pitch, i.e., imply is just over voiced intervals) of VDTTS and TTS in comparison with the ground-truth sign.

Video Samples

Unique VDTTS VDTTS video-only TTS
Unique shows the unique video clip. VDTTS, shows the audio predicted utilizing each the video frames and the textual content as enter. VDTTS video-only shows audio predictions utilizing video frames solely. TTS shows audio predictions utilizing textual content solely. High transcript: “of area for folks to make their very own judgments and to return to their very own”. Backside transcript: “completely love dancing I’ve no dance expertise in any respect however as that”.

Mannequin Efficiency

We’ve measured the VDTTS mannequin’s efficiency utilizing the VoxCeleb2 dataset and in contrast it to TTS and the TTS with size trace (a TTS that receives the scene size) fashions. We exhibit that VDTTS outperforms each fashions by massive margins in many of the features we measured: larger sync-to-video high quality (measured by SyncNet Distance) and higher speech high quality as measured by mel cepstral distance (MCD), and decrease Gross Pitch Error (GPE), which measures the share of frames the place pitch differed by greater than 20% on frames for which voice was current on each the expected and reference audio.

SyncNet distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier).
Mel cepstral distance comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier).
Gross Pitch Error comparability between VDTTS, TTS and the TTS with Size trace (a decrease metric is healthier).

Dialogue and Future Work

One factor to notice is that, intriguingly, VDTTS can produce video synchronized speech with none specific losses or constraints to advertise this, suggesting complexities akin to synchronization losses or specific modeling are pointless.

Whereas this can be a proof-of-concept demonstration, we consider that sooner or later, VDTTS will be upgraded for use in situations the place the enter textual content differs from the unique video sign. This sort of a mannequin could be a useful software for duties akin to translation dubbing.


We want to thank the co-authors of this analysis: Michelle Tadmor Ramanovich, Ye Jia, Brendan Shillingford, and Miaosen Wang. We’re additionally grateful to the valued contributions, discussions, and suggestions from Nadav Bar, Jay Tenenbaum, Zach Gleicher, Paul McCartney, Marco Tagliasacchi, and Yoni Tzafir.


Please enter your comment!
Please enter your name here