An Finish-to-Finish Neural Audio Codec


Audio codecs are used to effectively compress audio to scale back both storage necessities or community bandwidth. Ideally, audio codecs ought to be clear to the top person, in order that the decoded audio is perceptually indistinguishable from the unique and the encoding/decoding course of doesn’t introduce perceivable latency.

Over the previous few years, completely different audio codecs have been efficiently developed to satisfy these necessities, together with Opus and Enhanced Voice Providers (EVS). Opus is a flexible speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been broadly deployed throughout functions starting from video conferencing platforms, like Google Meet, to streaming providers, like YouTube. EVS is the most recent codec developed by the 3GPP standardization physique concentrating on cellular telephony. Like Opus, it’s a versatile codec working at a number of bitrates, 5.9 kbps to 128 kbps. The standard of the reconstructed audio utilizing both of those codecs is great at medium-to-low bitrates (12–20 kbps), nevertheless it degrades sharply when working at very low bitrates (⪅3 kbps). Whereas these codecs leverage professional data of human notion in addition to fastidiously engineered sign processing pipelines to maximise the effectivity of the compression algorithms, there was current curiosity in changing these handcrafted pipelines by machine studying approaches that be taught to encode audio in a data-driven method.

Earlier this yr, we launched Lyra, a neural audio codec for low-bitrate speech. In “SoundStream: an Finish-to-Finish Neural Audio Codec”, we introduce a novel neural audio codec that extends these efforts by offering higher-quality audio and increasing to encode completely different sound varieties, together with clear speech, noisy and reverberant speech, music, and environmental sounds. SoundStream is the primary neural community codec to work on speech and music, whereas having the ability to run in real-time on a smartphone CPU. It is ready to ship state-of-the-art high quality over a broad vary of bitrates with a single educated mannequin, which represents a big advance in learnable codecs.

Studying an Audio Codec from Knowledge
The principle technical ingredient of SoundStream is a neural community, consisting of an encoder, decoder and quantizer, all of that are educated end-to-end. The encoder converts the enter audio stream right into a coded sign, which is compressed utilizing the quantizer after which transformed again to audio utilizing the decoder. SoundStream leverages state-of-the-art options within the area of neural audio synthesis to ship audio at excessive perceptual high quality, by coaching a discriminator that computes a mix of adversarial and reconstruction loss features that induce the reconstructed audio to sound just like the uncompressed authentic enter. As soon as educated, the encoder and decoder could be run on separate purchasers to effectively transmit high-quality audio over a community.

SoundStream coaching and inference. Throughout coaching, the encoder, quantizer and decoder parameters are optimized utilizing a mix of reconstruction and adversarial losses, computed by a discriminator, which is educated to tell apart between the unique enter audio and the reconstructed audio. Throughout inference, the encoder and quantizer on a transmitter shopper ship the compressed bitstream to a receiver shopper that may then decode the audio sign.

Studying a Scalable Codec with Residual Vector Quantization
The encoder of SoundStream produces vectors that may take an indefinite variety of values. As a way to transmit them to the receiver utilizing a restricted variety of bits, it’s needed to switch them by shut vectors from a finite set (known as a codebook), a course of generally known as vector quantization. This strategy works properly at bitrates round 1 kbps or decrease, however rapidly reaches its limits when utilizing greater bitrates. For instance, even at a bitrate as little as 3 kbps, and assuming the encoder produces 100 vectors per second, one would want to retailer a codebook with greater than 1 billion vectors, which is infeasible in observe.

In SoundStream, we deal with this subject by proposing a brand new residual vector quantizer (RVQ), consisting of a number of layers (as much as 80 in our experiments). The primary layer quantizes the code vectors with reasonable decision, and every of the next layers processes the residual error from the earlier one. By splitting the quantization course of in a number of layers, the codebook measurement could be decreased drastically. For example, with 100 vectors per second at 3 kbps, and utilizing 5 quantizer layers, the codebook measurement goes from 1 billion to 320. Furthermore, we are able to simply enhance or lower the bitrate by including or eradicating quantizer layers, respectively.

As a result of community circumstances can fluctuate whereas transmitting audio, ideally a codec ought to be “scalable” in order that it could actually change its bitrate from low to excessive relying on the state of the community. Whereas most conventional codecs are scalable, earlier learnable codecs should be educated and deployed particularly for every bitrate.

To avoid this limitation, we leverage the truth that the variety of quantization layers in SoundStream controls the bitrate, and suggest a brand new technique known as “quantizer dropout”. Throughout coaching, we randomly drop some quantization layers to simulate a various bitrate. This pushes the decoder to carry out properly at any bitrate of the incoming audio stream, and thus helps SoundStream to grow to be “scalable” so {that a} single educated mannequin can function at any bitrate, performing in addition to fashions educated particularly for these bitrates.

Comparability of SoundStream fashions (greater is healthier) which might be educated at 18 kbps with quantizer dropout (bitrate scalable), with out quantizer dropout (not bitrate scalable) and evaluated with a variable variety of quantizers, or educated and evaluated at a set bitrate (bitrate particular). The bitrate-scalable mannequin (a single mannequin for all bitrates) doesn’t lose any high quality when in comparison with bitrate-specific fashions (a unique mannequin for every bitrate), because of quantizer dropout.

A State-of-the-Artwork Audio Codec
SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches the standard of EVS at 9.6 kbps, whereas utilizing 3.2x–4x fewer bits. Which means that encoding audio with SoundStream can present an analogous high quality whereas utilizing a considerably decrease quantity of bandwidth. Furthermore, on the similar bitrate, SoundStream outperforms the present model of Lyra, which relies on an autoregressive community. In contrast to Lyra, which is already deployed and optimized for manufacturing utilization, SoundStream continues to be at an experimental stage. Sooner or later, Lyra will incorporate the elements of SoundStream to offer each greater audio high quality and decreased complexity.

SoundStream at 3kbps vs. state-of-the-art codecs. MUSHRA rating is a sign of subjective high quality (the upper the higher).

The demonstration of SoundStream’s efficiency in comparison with Opus, EVS, and the unique Lyra codec is introduced in these audio examples, a number of that are supplied under.


Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  


Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  

Joint Audio Compression and Enhancement
In conventional audio processing pipelines, compression and enhancement (the elimination of background noise) are sometimes carried out by completely different modules. For instance, it’s potential to use an audio enhancement algorithm on the transmitter facet, earlier than audio is compressed, or on the receiver facet, after audio is decoded. In such a setup, every processing step contributes to the end-to-end latency. Conversely, we design SoundStream in such a method that compression and enhancement could be carried out collectively by the identical mannequin, with out growing the general latency. Within the following examples, we present that it’s potential to mix compression with background noise suppression, by activating and deactivating denoising dynamically (no denoising for five seconds, denoising for five seconds, no denoising for five seconds, and so on.).

Unique noisy audio  
Denoised output*
* Demonstrated by turning denoising on and off each 5 seconds.

Environment friendly compression is important at any time when one must transmit audio, whether or not when streaming a video, or throughout a convention name. SoundStream is a crucial step in the direction of enhancing machine learning-driven audio codecs. It outperforms state-of-the-art codecs, corresponding to Opus and EVS, can improve audio on demand, and requires deployment of solely a single scalable mannequin, reasonably than many.

SoundStream will likely be launched as part of the following, improved model of Lyra. By integrating SoundStream with Lyra, builders can leverage the prevailing Lyra APIs and instruments for his or her work, offering each flexibility and higher sound high quality. We will even launch it as a separate TensorFlow mannequin for experimentation.

The work described right here was authored by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi. We’re grateful for all discussions and suggestions on this work that we obtained from our colleagues at Google.


Please enter your comment!
Please enter your name here