In 2019 we launched Recorder, an audio recording app for Pixel telephones that helps customers create, handle, and edit audio recordings. It leverages latest developments in on-device machine studying to transcribe speech, acknowledge audio occasions, counsel tags for titles, and assist customers navigate transcripts.
Nonetheless, some Recorder customers discovered it troublesome to navigate lengthy recordings which have a number of audio system as a result of it isn’t clear who stated what. Throughout the Made By Google occasion this yr, we introduced the “speaker labels” characteristic for the Recorder app. This opt-in characteristic annotates a recording transcript with distinctive and nameless labels for every speaker (e.g., “Speaker 1”, “Speaker 2”, and so on.) in actual time through the recording. It considerably improves the readability and value of the recording transcripts. This characteristic is powered by Google’s new speaker diarization system named Flip-to-Diarize, which was first offered at ICASSP 2022.
|Left: Recorder transcript with out speaker labels. Proper: Recorder transcript with speaker labels.|
Our speaker diarization system leverages a number of extremely optimized machine studying fashions and algorithms to permit diarizing hours of audio in a real-time streaming vogue with restricted computational sources on cell units. The system primarily consists of three parts: a speaker flip detection mannequin that detects a change of speaker within the enter speech, a speaker encoder mannequin that extracts voice traits from every speaker flip, and a multi-stage clustering algorithm that annotates speaker labels to every speaker flip in a extremely environment friendly means. All parts run totally on the gadget.
|Structure of the Flip-to-Diarize system.|
Detecting Speaker Turns
The primary element of our system is a speaker flip detection mannequin based mostly on a Transformer Transducer (T-T), which converts the acoustic options into textual content transcripts augmented with a particular token
<st> representing a speaker flip. Not like previous personalized methods that use role-specific tokens (e.g.,
<affected person>) for conversations, this mannequin is extra generic and could be educated on and deployed to numerous utility domains.
In most functions, the output of a diarization system isn’t immediately proven to customers, however mixed with a separate computerized speech recognition (ASR) system that’s educated to have smaller phrase errors. Due to this fact, for the diarization system, we’re comparatively extra tolerant to phrase token errors than errors of the
<st> token. Based mostly on this instinct, we suggest a brand new token-level loss operate that enables us to coach a small speaker flip detection mannequin with excessive accuracy on predicted
<st> tokens. Mixed with edit-based minimal Bayes threat (EMBR) coaching, this new loss operate considerably improved the interval-based F1 rating on seven analysis datasets.
Extracting Voice Traits
As soon as the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder mannequin to extract an embedding vector (i.e., d-vector) to characterize the voice traits of every speaker flip. This strategy has a number of benefits over prior work that extracts embedding vectors from small fixed-length segments. First, it avoids extracting an embedding from a phase containing speech from a number of audio system. On the identical time, every embedding covers a comparatively giant time vary that incorporates enough indicators from the speaker. It additionally reduces the overall variety of embeddings to be clustered, thus making the clustering step inexpensive. These embeddings are processed completely on-device till speaker labeling of the transcript is accomplished, after which deleted.
After the audio recording is represented by a sequence of embedding vectors, the final step is to cluster these embedding vectors, and assign a speaker label to every. Nonetheless, since audio recordings from the Recorder app could be as brief as a couple of seconds, or so long as as much as 18 hours, it’s crucial for the clustering algorithm to deal with sequences of drastically totally different lengths.
For this we suggest a multi-stage clustering technique to leverage the advantages of various clustering algorithms. First, we use the speaker flip detection outputs to find out whether or not there are at the very least two totally different audio system within the recording. For brief sequences, we use agglomerative hierarchical clustering (AHC) because the fallback algorithm. For medium-length sequences, we use spectral clustering as our essential algorithm, and use the eigen-gap criterion for correct speaker depend estimation. For lengthy sequences, we scale back computational price through the use of AHC to pre-cluster the sequence earlier than feeding it to the principle algorithm. Throughout the streaming, we hold a dynamic cache of earlier AHC cluster centroids that may be reused for future clustering calls. This mechanism permits us to implement an higher sure on all the system with fixed time and area complexity.
This multi-stage clustering technique is a crucial optimization for on-device functions the place the price range for CPU, reminiscence, and battery could be very small, and permits the system to run in a low energy mode even after diarizing hours of audio. As a tradeoff between high quality and effectivity, the higher sure of the computational price could be flexibly configured for units with totally different computational sources.
|Diagram of the multi-stage clustering technique.|
Correction and Customization
In our real-time streaming speaker diarization system, because the mannequin consumes extra audio enter, it accumulates confidence on predicted speaker labels, and should sometimes make corrections to beforehand predicted low-confidence speaker labels. The Recorder app robotically updates the speaker labels on the display throughout recording to replicate the newest and most correct predictions.
On the identical time, the Recorder app’s UI permits the person to rename the nameless speaker labels (e.g., “Speaker 2”) to personalized labels (e.g., “automobile vendor”) for higher readability and simpler memorization for the person inside every recording.
|Recorder permits the person to rename the speaker labels for higher readability.|
At present, our diarization system principally runs on the CPU block of Google Tensor, Google’s custom-built chip that powers newer Pixel telephones. We’re engaged on delegating extra computations to the TPU block, which can additional scale back the general energy consumption of the diarization system. One other future work route is to leverage multilingual capabilities of speaker encoder and speech recognition fashions to increase this characteristic to extra languages.
The work described on this submit represents joint efforts from a number of groups inside Google. Contributors embrace Quan Wang, Yiling Huang, Evan Clark, Qi Cao, Han Lu, Guanlong Zhao, Wei Xia, Hasim Sak, Alvin Zhou, Jason Pelecanos, Luiza Timariu, Allen Su, Fan Zhang, Hugh Love, Kristi Bradford, Vincent Peng, Raff Tsai, Richard Chou, Yitong Lin, Ann Lu, Kelly Tsai, Hannah Bowman, Tracy Wu, Taral Joglekar, Dharmesh Mokani, Ajay Dudani, Ignacio Lopez Moreno, Diego Melendo Casado, Nino Tasca, Alex Gruenstein.