Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Conscious Neural Structure Search

0
109


Persevering with advances within the design and implementation of datacenter (DC) accelerators for machine studying (ML), corresponding to TPUs and GPUs, have been vital for powering fashionable ML fashions and purposes at scale. These improved accelerators exhibit peak efficiency (e.g., FLOPs) that’s orders of magnitude higher than conventional computing programs. Nevertheless, there’s a fast-widening hole between the obtainable peak efficiency supplied by state-of-the-art {hardware} and the precise achieved efficiency when ML fashions run on that {hardware}.

One strategy to handle this hole is to design hardware-specific ML fashions that optimize each efficiency (e.g., throughput and latency) and mannequin high quality. Latest purposes of neural structure search (NAS), an rising paradigm to automate the design of ML mannequin architectures, have employed a platform-aware multi-objective strategy that features a {hardware} efficiency goal. Whereas this strategy has yielded improved mannequin efficiency in observe, the small print of the underlying {hardware} structure are opaque to the mannequin. Consequently, there’s untapped potential to construct full functionality hardware-friendly ML mannequin architectures, with hardware-specific optimizations, for highly effective DC ML accelerators.

In “Looking for Quick Mannequin Households on Datacenter Accelerators”, printed at CVPR 2021, we superior the cutting-edge of hardware-aware NAS by mechanically adapting mannequin architectures to the {hardware} on which they are going to be executed. The strategy we suggest finds optimized households of fashions for which extra {hardware} efficiency positive aspects can’t be achieved with out loss in mannequin high quality (referred to as Pareto optimization). To perform this, we infuse a deep understanding of {hardware} structure into the design of the NAS search area for discovery of each single fashions and mannequin households. We offer quantitative evaluation of the efficiency hole between {hardware} and conventional mannequin architectures and exhibit the benefits of utilizing true {hardware} efficiency (i.e., throughput and latency), as a substitute of the efficiency proxy (FLOPs), because the efficiency optimization goal. Leveraging this superior hardware-aware NAS and constructing upon the EfficientNet structure, we developed a household of fashions, referred to as EfficientNetX, that exhibit the effectiveness of this strategy for Pareto-optimized ML fashions on TPUs and GPUs.

Platform-Conscious NAS for DC ML Accelerators

To attain excessive efficiency, ML fashions must adapt to fashionable ML accelerators. Platform-aware NAS integrates data of the {hardware} accelerator properties into all three pillars of NAS: (i) the search targets; (ii) the search area; and (iii) the search algorithm (proven under). We concentrate on the brand new search area as a result of it incorporates the constructing blocks wanted to compose the fashions and is the important thing hyperlink between the ML mannequin architectures and accelerator {hardware} architectures.

We assemble TPU/GPU specialised search areas with TPU/GPU-friendly operations to infuse {hardware} consciousness into NAS. For instance, a key adaptation is maximizing parallelism to make sure totally different {hardware} elements contained in the accelerators work collectively as effectively as attainable. This contains the matrix multiplication items (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, in addition to the vector processing items (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing mannequin arithmetic depth (i.e., optimizing the parallelism between computation and operations on the excessive bandwidth reminiscence) can be vital to realize high efficiency. To faucet into the total potential of the {hardware}, it’s essential for ML fashions to realize excessive parallelism inside and throughout these {hardware} elements.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search area and search targets.

Superior platform-aware NAS has an optimized search area containing a set of complementary methods to holistically enhance parallelism for ML mannequin execution on TPUs and GPUs:

  1. It makes use of specialised tensor reshaping methods to maximise the parallelism within the MXUs / TensorCores.
  2. It dynamically selects totally different activation features relying on matrix operation sorts to make sure overlapping of vector and matrix/tensor processing.
  3. It employs hybrid convolutions and a novel fusion technique to strike a stability between complete compute and arithmetic depth to make sure that computation and reminiscence entry occurs in parallel and to cut back the competition on VPUs / CUDA cores.
  4. With latency-aware compound scaling (LACS), which makes use of {hardware} efficiency as a substitute of FLOPs because the efficiency goal to seek for mannequin depth, width and resolutions, we guarantee parallelism in any respect ranges for the complete mannequin household on the Pareto-front.

EfficientNet-X: Platform-Conscious NAS-Optimized Laptop Imaginative and prescient Fashions for TPUs and GPUs

Utilizing this strategy to platform-aware NAS, we now have designed EfficientNet-X, an optimized pc imaginative and prescient mannequin household for TPUs and GPUs. This household builds upon the EfficientNet structure, which itself was initially designed by conventional multi-objective NAS with out true hardware-awareness because the baseline. The ensuing EfficientNet-X mannequin household achieves a mean speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

Along with the improved speeds, EfficientNet-X has make clear the non-proportionality between FLOPs and true efficiency. Many suppose FLOPs are ML efficiency proxy (i.e., FLOPs and efficiency are proportional), however they don’t seem to be. Whereas FLOPs are efficiency proxy for easy {hardware} corresponding to scalar machines, they’ll exhibit a margin of error of as much as 400% on superior matrix/tensor machines. For instance, due to its hardware-friendly mannequin structure, EfficientNet-X requires ~2x extra FLOPs than EfficientNet, however is ~2x sooner on TPUs and GPUs.

EfficientNet-X household achieves 1.5x–2x speedup on common over the state-of-the-art EfficientNet household, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Mannequin Efficiency on New Accelerator {Hardware} Platforms

Platform-aware NAS exposes the inside workings of the {hardware} and leverages these properties when designing hardware-optimized ML fashions. In a way, the “platform-awareness” of the mannequin is a “gene” that preserves data of find out how to optimize efficiency for a {hardware} household, even on new generations, with out the necessity to redesign the fashions. For instance, TPUv4i delivers as much as 3x increased peak efficiency (FLOPS) than its predecessor TPUv2, however EfficientNet efficiency solely improves by 30% when migrating from TPUv2 to TPUv4i. As compared, EfficientNet-X retains its platform-aware properties even on new {hardware} and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, using virtually the entire 3x peak efficiency achieve anticipated when upgrading between the 2 generations.

{Hardware} peak efficiency ratio of TPUv2 to TPUv4i and the geometric imply speedup of EfficientNet-X and EfficientNet households, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work

We exhibit find out how to enhance the capabilities of platform-aware NAS for datacenter ML accelerators, particularly TPUs and GPUs. Each platform-aware NAS and the EfficientNet-X mannequin household have been deployed in manufacturing and materialize as much as ~40% effectivity positive aspects and important high quality enhancements for varied inner pc imaginative and prescient initiatives throughout Google. Moreover, due to its deep understanding of accelerator {hardware} structure, platform-aware NAS was in a position to determine vital efficiency bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with important potential efficiency uplift. As subsequent steps, we’re engaged on increasing platform-aware NAS’s capabilities to the ML {hardware} and mannequin design past pc imaginative and prescient.

Acknowledgements

Particular due to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We additionally thank many collaborators together with Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and plenty of others from the broad Google analysis and engineering groups who helped on the analysis and the next broad manufacturing deployment of platform-aware NAS.

LEAVE A REPLY

Please enter your comment!
Please enter your name here