Small, Common Speech Representations for Paralinguistic Duties


Lately, we have now seen dramatic enhancements on lexical duties akin to computerized speech recognition (ASR). Nonetheless, machine techniques nonetheless battle to know paralinguistic facets — akin to tone, emotion, whether or not a speaker is carrying a masks, and many others. Understanding these facets represents one of many remaining tough issues in machine listening to. As well as, state-of-the-art outcomes typically come from ultra-large fashions skilled on personal information, making them impractical to run on cellular gadgets or to launch publicly.

In “Common Paralinguistic Speech Representations Utilizing Self-Supervised Conformers”, to look in ICASSP 2022, we introduce CAP12— the twelfth layer of a 600M parameter mannequin skilled on the YT-U coaching dataset utilizing self-supervision. We reveal that the CAP12 mannequin outperforms almost all earlier leads to our paralinguistic benchmark, generally by giant margins, despite the fact that earlier outcomes are sometimes task-specific. In “TRILLsson: Distilled Common Paralinguistic Speech Representations”, we introduce the small, performant, publicly-available TRILLsson fashions and reveal how we diminished the dimensions of the high-performing CAP12 mannequin by 6x-100x whereas sustaining 90-96% of the efficiency. To create TRILLsson, we apply data distillation on appropriately-sized audio chunks and use totally different structure sorts to coach smaller, sooner networks which can be sufficiently small to run on cellular gadgets.

1M-Hour Dataset to Practice Extremely-Massive Self-Supervised Fashions

We leverage the YT-U coaching dataset to coach the ultra-large, self-supervised CAP12 mannequin. The YT-U dataset is a extremely assorted, 900M+ hour dataset that accommodates audio of assorted matters, background situations, and speaker acoustic properties.

Video classes by size (outer) and quantity (inside), demonstrating the variability within the YT-U dataset (determine from BigSSL)

We then modify a Wav2Vec 2.0 self-supervised coaching paradigm, which might remedy duties utilizing uncooked information with out labels, and mix it with ultra-large Conformer fashions. As a result of self-training would not require labels, we are able to take full benefit of YT-U by scaling up our fashions to among the largest mannequin sizes ever skilled, together with 600M, 1B, and 8B parameters.

NOSS: A Benchmark for Paralinguistic Duties

We reveal that an intermediate illustration of one of many earlier fashions accommodates a state-of-the-art illustration for paralinguistic speech. We name the 600M parameter Conformer mannequin with out relative consideration Conformer Utilized to Paralinguistics (CAP). We exhaustively search by all intermediate representations of six ultra-large fashions and discover that layer 12 (CAP12) outperforms earlier representations by vital margins.

To measure the standard of the roughly 300 candidate paralinguistic speech representations, we consider on an expanded model of the NOn-Semantic Speech (NOSS) benchmark, which is a set of well-studied paralinguistic speech duties, akin to speech emotion recognition, language identification, and speaker identification. These duties concentrate on paralinguistics facets of speech, which require evaluating speech options on the order of 1 second or longer, quite than lexical options, which require 100ms or shorter. We then add to the benchmark a mask-wearing job launched at Interspeech 2020, a faux speech detection job (ASVSpoof 2019), a job to detect the extent of dysarthria from challenge Euphonia, and an extra speech emotion recognition job (IEMOCAP). By increasing the benchmark and growing the variety of the duties, we empirically reveal that CAP12 is much more usually helpful than earlier representations.

Easy linear fashions on time-averaged CAP12 representations even outperform complicated, task-specific fashions on 5 out of eight paralinguistic duties. That is shocking as a result of comparable fashions generally use further modalities (e.g., imaginative and prescient and speech, or textual content and speech) as effectively. Moreover, CAP12 is exceptionally good at emotion recognition duties. CAP12 embeddings additionally outperform all different embeddings on all different duties with solely a single exception: for one embedding from a supervised community on the dysarthria detection job.

Mannequin Voxceleb   Voxforge   Speech Instructions   ASVSpoof2019∗∗   Euphonia#   CREMA-D   IEMOCAP
Prev SoTA 95.4 97.9 5.11 45.9 74.0 67.6+
TRILL 12.6 84.5 77.6 74.6 48.1 65.7 54.3
ASR Embedding 5.2 98.9 96.1 11.2 54.5 71.8 65.4
Wav2Vec2 layer 6†† 17.9 98.5 95.0 6.7 48.2 77.4 65.8
CAP12 51.0 99.7 97.0 2.5 51.5 88.2 75.0
Check efficiency on the NOSS Benchmark and prolonged duties. “Prev SoTA” signifies the earlier greatest performing state-of-the-art mannequin, which has arbitrary complexity, however all different rows are linear fashions on time-averaged enter. Filtered in response to YouTube’s privateness pointers. ∗∗ Makes use of equal error price [20]. # The one private dataset. We exclude it from combination scores. Audio and visible options utilized in earlier state-of-the-art fashions. + The earlier state-of-the-art mannequin carried out cross-validation. For our analysis, we maintain out two particular audio system as a take a look at. †† Wav2Vec 2.0 mannequin from HuggingFace. Finest general layer was layer 6.

TRILLsson: Small, Excessive High quality, Publicly Out there Fashions

Just like FRILL, our subsequent step was to make an on-device, publicly obtainable model of CAP12. This concerned utilizing data distillation to coach smaller, sooner, mobile-friendly architectures. We experimented with EfficientNet, Audio Spectrogram Transformer (AST), and ResNet. These mannequin sorts are very totally different, and canopy each fixed-length and arbitrary-length inputs. EfficientNet comes from a neural structure search over imaginative and prescient fashions to seek out concurrently performant and environment friendly mannequin buildings. AST fashions are transformers tailored to audio inputs. ResNet is a normal structure that has proven good efficiency throughout many various fashions.

We skilled fashions that carried out on common 90-96% in addition to CAP12, regardless of being 1%-15% the dimensions and skilled utilizing solely 6% the information. Apparently, we discovered that totally different structure sorts carried out higher at totally different sizes. ResNet fashions carried out greatest on the low finish, EfficientNet within the center, and AST fashions on the bigger finish.

Mixture embedding efficiency vs. mannequin dimension for numerous scholar mannequin architectures and sizes. We reveal that ResNet architectures carry out greatest for small sizes, EfficientNetV2 performs greatest within the midsize mannequin vary, as much as the biggest mannequin dimension examined, after which the bigger AST fashions are greatest.

We carry out data distillation with the aim of matching a scholar, with a fixed-size enter, to the output of a trainer, with a variable-size enter, for which there are two strategies of producing scholar targets: international matching and native matching. World matching produces distillation targets by producing CAP12 embeddings for a whole audio clip, after which requires {that a} scholar match the goal from only a small section of audio (e.g., 2 seconds). Native matching requires that the coed community match the common CAP12 embedding simply over the smaller portion of the audio that the coed sees. In our work, we targeted on native matching.

Two varieties of producing distillation targets for sequences. Left: World matching makes use of the common CAP12 embedding over the entire clip for the goal for every native chunk. Proper: Native matching makes use of CAP12 embeddings averaged simply over native clips because the distillation goal.

Commentary of Bimodality and Future Instructions

Paralinguistic info exhibits an sudden bimodal distribution. For the CAP mannequin that operates on 500 ms enter segments, and two of the full-input Conformer fashions, intermediate representations steadily enhance in paralinguistic info, then lower, then enhance once more, and eventually lose this info in the direction of the output layer. Surprisingly, this sample can be seen when exploring the intermediate representations of networks skilled on retinal pictures.

500 ms inputs to CAP present a comparatively pronounced bimodal distribution of paralinguistic info throughout layers.
Two of the conformer fashions with full inputs present a bimodal distribution of paralinguistic info throughout layers.

We hope that smaller, sooner fashions for paralinguistic speech unlock new purposes in speech recognition, text-to-speech era, and understanding consumer intent. We additionally count on that smaller fashions shall be extra simply interpretable, which can enable researchers to know what facets of speech are vital for paralinguistics. Lastly, we hope that our open-sourced speech representations are utilized by the group to enhance paralinguistic speech duties and consumer understanding in personal or small datasets.


I might prefer to thank my co-authors Aren Jansen, Wei Han, Daniel Park, Yu Zhang, and Subhashini Venugopalan for his or her onerous work and creativity on this challenge. I might additionally prefer to thank the members of the big collaboration for the BigSSL work, with out which these initiatives wouldn’t be potential. The workforce consists of James Qin, Anmol Gulati, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, and Yonghui Wu.


Please enter your comment!
Please enter your name here