Environment friendly Sequence Modeling for On-System ML


The growing demand for machine studying (ML) mannequin inference on-device (for cellular gadgets, tablets, and so forth.) is pushed by the rise of compute-intensive functions, the necessity to maintain sure information on machine for privateness and safety causes, and the will to supply companies when a community connection is probably not out there. Nonetheless, on-device inference introduces a myriad of challenges, starting from modeling to platform help necessities. These challenges relate to how completely different architectures are designed to optimize reminiscence and computation, whereas nonetheless making an attempt to take care of the standard of the mannequin. From a platform perspective, the problem is figuring out operations and constructing on prime of them in a manner that may generalize effectively throughout completely different product use instances.

In earlier analysis, we mixed a novel method for producing embeddings (known as projection-based embeddings) with environment friendly architectures like QRNN (pQRNN) and proved them to be competent for various classification issues. Augmenting these with distillation strategies gives a further bump in end-to-end high quality. Though that is an efficient method, it isn’t scalable to greater and extra intensive vocabularies (i.e., all potential Unicode or phrase tokens that may be fed to the mannequin). Moreover, the output from the projection operation itself doesn’t include trainable weights to reap the benefits of pre-training the mannequin.

Token-free fashions introduced in ByT5 are a great start line for on-device modeling that may handle pre-training and scalability points with out the necessity to enhance the dimensions of the mannequin. That is potential as a result of these approaches deal with textual content inputs as a stream of bytes (every byte has a worth that ranges from 0 to 255) that may scale back the vocabulary measurement for the embedding tables from ~30,000 to 256. Though ByT5 presents a compelling different for on-device modeling, going from word-level illustration to byte stream illustration will increase the sequence lengths linearly; with a median phrase size of 4 characters and a single character having as much as 4 bytes, the byte sequence size will increase proportionally to the phrase size. This will result in a big enhance in inference latency and computational prices.

We handle this drawback by growing and releasing three novel byte-stream sequence fashions for the SeqFlowLite library (ByteQRNN, ByteTransformer and ByteFunnelTransformer), all of which could be pre-trained on unsupervised information and could be fine-tuned for particular duties. These fashions leverage current improvements launched by Charformer, together with a quick character Transformer-based mannequin that makes use of a gradient-based subword tokenization (GBST) method to function immediately on the byte stage, in addition to a “smooth” tokenization method, which permits us to study token boundaries and scale back sequence lengths. On this put up, we deal with ByteQRNN and reveal that the efficiency of a pre-trained ByteQRNN mannequin is akin to BERT, regardless of being 300x smaller.

Sequence Mannequin Structure

We leverage pQRNN, ByT5 and Charformer together with platform optimizations, akin to in-training quantization (which tracks minimal and most float values for mannequin activations and weights for quantizing the inference mannequin) that reduces mannequin sizes by one-fourth, to develop an end-to-end mannequin known as ByteQRNN (proven beneath). First, we use a ByteSplitter operation to separate the enter string right into a byte stream and feed it to a smaller embedding desk that has a vocabulary measurement of 259 (256 + 3 further meta tokens).

The output from the embedding layer is fed to the GBST layer, which is provided with in-training quantization and combines byte-level representations with the effectivity of subword tokenization whereas enabling end-to-end studying of latent subwords. We “smooth” tokenize the byte stream sequences by enumerating and mixing every subword block size with scores (computed with a quantized dense layer) at every strided token place (i.e., at token positions which are chosen at common intervals). Subsequent, we downsample the byte stream to manageable sequence size and feed it to the encoder layer.

The output from the GBST layer could be downsampled to a decrease sequence size for environment friendly encoder computation or can be utilized by an encoder, like Funnel Transformer, which swimming pools the question size and reduces the self-attention computation to create the ByteFunnelTransformer mannequin. The encoder within the end-to-end mannequin could be changed with some other encoder layer, such because the Transformer from the SeqFlowLite library, to create a ByteTransformer mannequin.

A diagram of a generic end-to-end sequence mannequin utilizing byte stream enter. The ByteQRNN mannequin makes use of a QRNN encoder from the SeqFlowLite library.

Along with the enter embeddings (i.e., the output from the embedding layer described above), we go a step additional to construct an efficient sequence-to-sequence (seq2seq) mannequin. We achieve this by taking ByteQRNN and including a Transformer-based decoder mannequin together with a quantized beam search (or tree exploration) to go together with it. The quantized beam search module reduces the inference latency when producing decoder outputs by computing the most probably beams (i.e., potential output sequences) utilizing the logarithmic sum of earlier and present chances and returns the ensuing prime beams. Right here the system makes use of a extra environment friendly 8-bit integer (uint8) format, in comparison with a typical single-precision floating-point format (float32) mannequin.

The decoder Transformer mannequin makes use of a merged consideration sublayer (MAtt) to scale back the complexity of the decoder self-attention from quadratic to linear, thereby decreasing the end-to-end latency. For every decoding step, MAtt makes use of a fixed-size cache for decoder self-attention in comparison with the growing cache measurement of a conventional transformer decoder. The next determine illustrates how the beam search module interacts with the decoder layer to generate output tokens on-device utilizing an edge machine (e.g., cellphones, tablets, and so forth.).

A comparability of cloud server decoding and on-device (edge machine) implementation. Left: Cloud server beam search employs a Transformer-based decoder mannequin with quadratic time self-attention in float32, which has an growing cache measurement for every decoding step. Proper: The sting machine implementation employs a quantized beam search module together with a fixed-size cache and a linear time self-attention computation.


After growing ByteQRNN, we consider its efficiency on the civil_comments dataset utilizing the space beneath the curve (AUC) metric and evaluate it to a pre-trained ByteQRNN and BERT (proven beneath). We reveal that the fine-tuned ByteQRNN improves the general high quality and brings its efficiency nearer to the BERT fashions, regardless of being 300x smaller. Since SeqFlowLite fashions help in-training quantization that reduces mannequin sizes by one-fourth, the ensuing fashions scale effectively to low-compute gadgets. We selected multilingual information sources that associated to the duty for pre-training each BERT and byte stream fashions to realize the absolute best efficiency.

Comparability of ByteQRNN with fine-tuned ByteQRNN and BERT on the civil_comments dataset.


Following up on our earlier work with pQRNN, we consider byte stream fashions for on-device use to allow pre-training and thereby enhance mannequin efficiency for on-device deployment. We current an analysis for ByteQRNN with and with out pre-training and reveal that the efficiency of the pre-trained ByteQRNN is akin to BERT, regardless of being 300x smaller. Along with ByteQRNN, we’re additionally releasing ByteTransformer and ByteFunnelTransformer, two fashions which use completely different encoders, together with the merged consideration decoder mannequin and the beam search driver to run the inference by way of the SeqFlowLite library. We hope these fashions will present researchers and product builders with invaluable sources for future on-device deployments.


We want to thank Khoa Trinh, Jeongwoo Ko, Peter Younger and Yicheng Fan for serving to with open-sourcing and evaluating the mannequin. Due to Prabhu Kaliamoorthi for all of the brainstorming and ideation. Due to Vinh Tran, Jai Gupta and Yi Tay for his or her assist with pre-training byte stream fashions. Due to Ruoxin Sang, Haoyu Zhang, Ce Zheng, Chuanhao Zhuge and Jieying Luo for serving to with the TPU coaching. Many because of Erik Vee, Ravi Kumar and the Learn2Compress management for sponsoring the mission and their help and encouragement. Lastly, we want to thank Tom Small for the animated determine used on this put up.


Please enter your comment!
Please enter your name here