A Massively Multilingual Speech-to-Speech Translation Corpus


Computerized translation of speech from one language to speech in one other language, known as speech-to-speech translation (S2ST), is vital for breaking down the communication boundaries between individuals talking totally different languages. Conventionally, computerized S2ST programs are constructed with a cascade of computerized speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, in order that the system general is text-centric. Not too long ago, work on S2ST that doesn’t depend on intermediate textual content illustration is rising, reminiscent of end-to-end direct S2ST (e.g., Translatotron) and cascade S2ST primarily based on discovered discrete representations of speech (e.g., Tjandra et al.). Whereas early variations of such direct S2ST programs obtained decrease translation high quality in comparison with cascade S2ST fashions, they’re gaining traction as they’ve the potential each to scale back translation latency and compounding errors, and to raised protect paralinguistic and non-linguistic data from the unique speech, reminiscent of voice, emotion, tone, and so on. Nevertheless, such fashions often need to be educated on datasets with paired S2ST knowledge, however the public availability of such corpora is extraordinarily restricted.

To foster analysis on such a brand new era of S2ST, we introduce a Widespread Voice-based Speech-to-Speech translation corpus, or CVSS, which incorporates sentence-level speech-to-speech translation pairs from 21 languages into English. In contrast to present public corpora, CVSS might be instantly used for coaching such direct S2ST fashions with none further processing. In “CVSS Corpus and Massively Multilingual Speech-to-Speech Translation”, we describe the dataset design and growth, and reveal the effectiveness of the corpus by way of coaching of baseline direct and cascade S2ST fashions and displaying efficiency of a direct S2ST mannequin that approaches that of a cascade S2ST mannequin.

Constructing CVSS
CVSS is instantly derived from the CoVoST 2 speech-to-text (ST) translation corpus, which is additional derived from the Widespread Voice speech corpus. Widespread Voice is a massively multilingual transcribed speech corpus designed for ASR wherein the speech is collected by contributors studying textual content content material from Wikipedia and different textual content corpora. CoVoST 2 additional offers skilled textual content translation for the unique transcript from 21 languages into English and from English into 15 languages. CVSS builds on these efforts by offering sentence-level parallel speech-to-speech translation pairs from 21 languages into English (proven within the desk under).

To facilitate analysis with totally different focuses, two variations of translation speech in English are supplied in CVSS, each are synthesized utilizing state-of-the-art TTS programs, with every model offering distinctive worth that doesn’t exist in different public S2ST corpora:

  • CVSS-C: All the interpretation speech is in a single canonical speaker’s voice. Regardless of being artificial, the speech is very pure, clear, and constant in talking type. These properties ease the modeling of the goal speech and allow educated fashions to provide top quality translation speech appropriate for basic user-facing functions the place speech high quality is of upper significance than precisely reproducing the audio system’ voices.
  • CVSS-T: The interpretation speech captures the voice from the corresponding supply speech. Every S2ST pair has the same voice on the 2 sides, regardless of being in several languages. Due to this, the dataset is appropriate for constructing fashions the place correct voice preservation is desired, reminiscent of for film dubbing.

Along with the supply speech, the 2 S2ST datasets comprise 1,872 and 1,937 hours of speech, respectively.

Code     Supply
  speech (X)  
  goal speech (En)  
  goal speech (En)  
French fr 309.3



German de 226.5



Catalan ca 174.8



Spanish es 157.6



Italian it 73.9



Persian fa 58.8



Russian ru 38.7



Chinese language zh 26.5



Portuguese     pt 20.0



Dutch nl 11.2



Estonian et 9.0



Mongolian mn 8.4



Turkish tr 7.9



Arabic ar 5.8



Latvian lv 4.9



Swedish sv 4.3



Welsh cy 3.6



Tamil ta 3.1



Indonesian id 3.0



Japanese ja 3.0



Slovenian sl 2.9



Whole 1,153.2



Quantity of supply and goal speech of every X-En pair in CVSS (hours).

Along with translation speech, CVSS additionally offers normalized translation textual content matching the pronunciation within the translation speech (on numbers, currencies, acronyms, and so on., see knowledge samples under, e.g., the place “100%” is normalized as “a hundred percent” or “King George II” is normalized as “king george the second”), which may profit each mannequin coaching in addition to standardizing the analysis.

CVSS is launched below the Artistic Commons Attribution 4.0 Worldwide (CC BY 4.0) license and it may be freely downloaded on-line.

Information Samples

Instance 1:
Supply audio (French)   
Supply transcript (French)    Le style musical de la chanson est entièrement le disco.
CVSS-C translation audio (English)   
CVSS-T translation audio (English)   
Translation textual content (English)    The musical style of the track is 100% Disco.
Normalized translation textual content (English)        the musical style of the track is a hundred percent disco
Instance 2:
Supply audio (Chinese language)       
Supply transcript (Chinese language)        弗雷德里克王子,英国王室成员,为乔治二世之孙,乔治三世之幼弟。
CVSS-C translation audio (English)       
CVSS-T translation audio (English)       
Translation textual content (English)        Prince Frederick, member of British Royal Household, Grandson of King George II, brother of King George III.
Normalized translation textual content (English)        prince frederick member of british royal household grandson of king george the second brother of king george the third

Baseline Fashions
On every model of CVSS, we educated a baseline cascade S2ST mannequin in addition to two baseline direct S2ST fashions and in contrast their efficiency. These baselines can be utilized for comparability in future analysis.

Cascade S2ST: To construct robust cascade S2ST baselines, we educated an ST mannequin on CoVoST 2, which outperforms the earlier states of the artwork by +5.8 common BLEU on all 21 language pairs (detailed within the paper) when educated on the corpus with out utilizing further knowledge. This ST mannequin is related to the identical TTS fashions used for developing CVSS to compose very robust cascade S2ST baselines (ST → TTS).

Direct S2ST: We constructed two baseline direct S2ST fashions utilizing Translatotron and Translatotron 2. When educated from scratch with CVSS, the interpretation high quality from Translatotron 2 (8.7 BLEU) approaches that of the robust cascade S2ST baseline (10.6 BLEU). Furthermore, when each use pre-training the hole decreases to solely 0.7 BLEU on ASR transcribed translation. These outcomes confirm the effectiveness of utilizing CVSS to coach direct S2ST fashions.

Translation high quality of baseline direct and cascade S2ST fashions constructed on CVSS-C, measured by BLEU on ASR transcription from speech translation. The pre-training was accomplished on CoVoST 2 with out different further knowledge units.

We now have launched two variations of multilingual-to-English S2ST datasets, CVSS-C and CVSS-T, every with about 1.9K hours of sentence-level parallel S2ST pairs, protecting 21 supply languages. The interpretation speech in CVSS-C is in a single canonical speaker’s voice, whereas the identical in CVSS-T is in voices transferred from the supply speech. Every of those datasets offers distinctive worth not present in different public S2ST corpora.

We constructed baseline multilingual direct S2ST fashions and cascade S2ST fashions on each datasets, which can be utilized for comparability in future works. To construct robust cascade S2ST baselines, we educated an ST mannequin on CoVoST 2, which outperforms the earlier states of the artwork by +5.8 common BLEU when educated on the corpus with out further knowledge. Nonetheless, the efficiency of the direct S2ST fashions approaches the robust cascade baselines when educated from scratch, and with solely 0.7 BLEU distinction on ASR transcribed translation when utilized pre-training. We hope this work helps speed up the analysis on direct S2ST.

We acknowledge the volunteer contributors and the organizers of the Widespread Voice and LibriVox initiatives for his or her contribution and assortment of recordings, the creators of Widespread Voice, CoVoST, CoVoST 2, Librispeech and LibriTTS corpora for his or her earlier work. The direct contributors to the CVSS corpus and the paper embody Ye Jia, Michelle Tadmor Ramanovich, Quan Wang, Heiga Zen. We additionally thank Ankur Bapna, Yiling Huang, Jason Pelecanos, Colin Cherry, Alexis Conneau, Yonghui Wu, Hadar Shemtov and Françoise Beaufays for useful discussions and help.


Please enter your comment!
Please enter your name here