Google AI Weblog: Alpa: Automated Mannequin-Parallel Deep Studying


Replace — 2022/05/06: This submit has been up to date to supply extra particulars in regards to the collaboration with exterior analysis companions within the improvement of Alpa.

During the last a number of years, the quickly rising dimension of deep studying fashions has shortly exceeded the reminiscence capability of single accelerators. Earlier fashions like BERT (with a parameter dimension of < 1GB) can effectively scale throughout accelerators by leveraging knowledge parallelism through which mannequin weights are duplicated throughout accelerators whereas solely partitioning and distributing the coaching knowledge. Nevertheless, current giant fashions like GPT-3 (with a parameter dimension of 175GB) can solely scale utilizing mannequin parallel coaching, the place a single mannequin is partitioned throughout totally different gadgets.

Whereas mannequin parallelism methods make it potential to coach giant fashions, they’re extra advanced in that they must be particularly designed for goal neural networks and compute clusters. For instance, Megatron-LM makes use of a mannequin parallelism technique to separate the load matrices by rows or columns after which synchronizes outcomes amongst gadgets. Gadget placement or pipeline parallelism partitions totally different operators in a neural community into a number of teams and the enter knowledge into micro-batches which can be executed in a pipelined trend. Mannequin parallelism typically requires vital effort from system specialists to determine an optimum parallelism plan for a selected mannequin. However doing so is simply too onerous for many machine studying (ML) researchers whose major focus is to run a mannequin and for whom the mannequin’s efficiency turns into a secondary precedence. As such, there stays a possibility to automate mannequin parallelism in order that it will probably simply be utilized to giant fashions.

In “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Studying”, revealed at OSDI 2022, we describe a technique for automating the advanced strategy of parallelizing a mannequin, which was developed in shut collaboration with researchers at Amazon Internet Providers, UC Berkeley, Shanghai Jiao Tong College, Duke College and Carnegie Mellon College. We reveal that with just one line of code Alpa can rework any JAX neural community right into a distributed model with an optimum parallelization technique that may be executed on a user-provided system cluster. We’re additionally excited to announce the open-source launch of Alpa’s code to the broader analysis group.

Alpa Design

We start by grouping present ML parallelization methods into two classes, inter-operator parallelism and intra-operator parallelism. Inter-operator parallelism assigns distinct operators to totally different gadgets (e.g., system placement) which can be typically accelerated with a pipeline execution schedule (e.g., pipeline parallelism). With intra-operator parallelism, which incorporates knowledge parallelism (e.g., Deepspeed-Zero), operator parallelism (e.g., Megatron-LM), and skilled parallelism (e.g., GShard-MoE), particular person operators are break up and executed on a number of gadgets, and infrequently collective communication is used to synchronize the outcomes throughout gadgets.

The distinction between these two approaches maps naturally to the heterogeneity of a typical compute cluster. Inter-operator parallelism has decrease communication bandwidth necessities as a result of it is just transmitting activations between operators on totally different accelerators. However, it suffers from system underutilization due to its pipeline knowledge dependency, i.e., some operators are inactive whereas ready on the outputs from different operators. In distinction, intra-operator parallelism doesn’t have the info dependency problem, however requires heavier communication throughout gadgets. In a GPU cluster, the GPUs inside a node have larger communication bandwidth that may accommodate intra-operator parallelism. Nevertheless, GPUs throughout totally different nodes are sometimes linked with a lot decrease bandwidth (e.g., ethernet) so inter-operator parallelism is most well-liked.

By leveraging heterogeneous mapping, we design Alpa as a compiler that conducts varied passes when given a computational graph and a tool cluster from a consumer. First, the inter-operator move slices the computational graph into subgraphs and the system cluster into submeshes (i.e., a partitioned system cluster) and identifies one of the simplest ways to assign a subgraph to a submesh. Then, the intra-operator move finds the very best intra-operator parallelism plan for every pipeline stage from the inter-operator move. Lastly, the runtime orchestration move generates a static plan that orders the computation and communication and executes the distributed computational graph on the precise system cluster.

An outline of Alpa. Within the sliced subgraphs, pink and blue characterize the way in which the operators are partitioned and grey represents operators which can be replicated. Inexperienced represents the precise gadgets (e.g., GPUs).

Intra-Operator Go

Much like earlier analysis (e.g., Mesh-TensorFlow and GSPMD), intra-operator parallelism partitions a tensor on a tool mesh. That is proven beneath for a typical 3D tensor in a Transformer mannequin with a given batch, sequence, and hidden dimensions. The batch dimension is partitioned alongside system mesh dimension 0 (mesh0), the hidden dimension is partitioned alongside mesh dimension 1 (mesh1), and the sequence dimension is replicated to every processor.

A 3D tensor that’s partitioned on a 2D system mesh.

With the partitions of tensors in Alpa, we additional outline a set of parallelization methods for every particular person operator in a computational graph. We present instance parallelization methods for matrix multiplication within the determine beneath. Defining parallelization methods on operators results in potential conflicts on the partitions of tensors as a result of one tensor might be each the output of 1 operator and the enter of one other. On this case, re-partition is required between the 2 operators, which incurs extra communication prices.

The parallelization methods for matrix multiplication.

Given the partitions of every operator and re-partition prices, we formulate the intra-operator move as a Integer-Linear Programming (ILP) drawback. For every operator, we outline a one-hot variable vector to enumerate the partition methods. The ILP goal is to attenuate the sum of compute and communication price (node price) and re-partition communication price (edge price). The answer of the ILP interprets to at least one particular option to partition the unique computational graph.

Inter-Operator Go

The inter-operator move slices the computational graph and system cluster for pipeline parallelism. As proven beneath, the bins characterize micro-batches of enter and the pipeline phases characterize a submesh executing a subgraph. The horizontal dimension represents time and reveals the pipeline stage at which a micro-batch is executed. The aim of the inter-operator move is to attenuate the entire execution latency, which is the sum of the complete workload execution on the system as illustrated within the determine beneath. Alpa makes use of a Dynamic Programming (DP) algorithm to attenuate the entire latency. The computational graph is first flattened, after which fed to the intra-operator move the place the efficiency of all potential partitions of the system cluster into submeshes are profiled.

Pipeline parallelism. For a given time, this determine reveals the micro-batches (coloured bins) {that a} partitioned system cluster and a sliced computational graph (e.g., stage 1, 2, 3) is processing.

Runtime Orchestration

After the inter- and intra-operator parallelization methods are full, the runtime generates and dispatches a static sequence of execution directions for every system submesh. These directions embrace RUN a selected subgraph, SEND/RECEIVE tensors from different meshes, or DELETE a selected tensor to free the reminiscence. The gadgets can execute the computational graph with out different coordination by following the directions.


We take a look at Alpa with eight AWS p3.16xlarge situations, every of which has eight 16 GB V100 GPUs, for 64 whole GPUs. We look at weak scaling outcomes of rising the mannequin dimension whereas rising the variety of GPUs. We consider three fashions: (1) the usual Transformer mannequin (GPT); (2) the GShard-MoE mannequin, a transformer with mixture-of-expert layers; and (3) Extensive-ResNet, a considerably totally different mannequin with no present expert-designed mannequin parallelization technique. The efficiency is measured by peta-floating level operations per second (PFLOPS) achieved on the cluster.

We reveal that for GPT, Alpa outputs a parallelization technique similar to the one computed by the very best present framework, Megatron-ML, and matches its efficiency. For GShard-MoE, Alpa outperforms the very best expert-designed baseline on GPU (i.e., Deepspeed) by as much as 8x. Outcomes for Extensive-ResNet present that Alpa can generate the optimum parallelization technique for fashions that haven’t been studied by specialists. We additionally present the linear scaling numbers for reference.

GPT: Alpa matches the efficiency of Megatron-ML, the very best expert-designed framework.
GShard MoE: Alpa outperforms Deepspeed (the very best expert-designed framework on GPU) by as much as 8x.
Extensive-ResNet: Alpa generalizes to fashions with out handbook plans. Pipeline and Knowledge Parallelism (PP-DP) is a baseline mannequin that makes use of solely pipeline and knowledge parallelism however no different intra-operator parallelism.
The parallelization technique for Extensive-ResNet on 16 GPUs consists of three pipeline phases and is an advanced technique even for an skilled to design. Phases 1 and a couple of are on 4 GPUs performing knowledge parallelism, and stage 3 is on 8 GPUs performing operator parallelism.


The method of designing an efficient parallelization plan for distributed model-parallel deep studying has traditionally been a tough and labor-intensive activity. Alpa is a brand new framework that leverages intra- and inter-operator parallelism for automated model-parallel distributed coaching. We imagine that Alpa will democratize distributed model-parallel studying and speed up the event of huge deep studying fashions. Discover the open-source code and be taught extra about Alpa in our paper.


Thanks to our colleagues at AWS, UC Berkeley, Shanghai Jiao Tong College, Duke College and Carnegie Mellon College for his or her partnership on this analysis. Because of the co-authors of the paper: Lianmin Zheng, Hao Zhang, Yonghao Zhuang, Yida Wang, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Lianmin (UC Berkeley & AWS) leads the hassle on the intra-operator move, Zhuohan (UC Berkeley & Google) leads the hassle on the inter-operator move, and Hao (UC Berkeley) leads the hassle on the runtime orchestration move. This undertaking wouldn’t be potential with out the internship alternatives offered by AWS and Google for Lianmin and Zhuohan respectively. We might additionally wish to thank Shibo Wang, Jinliang Wei, Yanping Huang, Yuanzhong Xu, Zhifeng Chen, Claire Cui, Naveen Kumar, Yash Katariya, Laurent El Shafey, Qiao Zhang, Yonghui Wu, Marcello Maggioni, Mingyao Yang, Michael Isard, Skye Wanderman-Milne, and David Majnemer for his or her collaborations to this analysis.


Please enter your comment!
Please enter your name here