覯企Μ 2021 覦 Large-Scale Transformer 覈語 覲 旧 朱 襴觀一.
TeraPipe Token-Level Pipeline Parallelism for Training Large-Scale Language Models
https://arxiv.org/abs/2102.07988
1 of 15
Download to read offline
More Related Content
[Paper review] tera pipe: token level pipeline parallelism for training large-scale language models, berkely, 2021
1. Review by Seong Hoon Jung
hoondori@gmail.com
2021.03
Apache spark
Ray
2. Motivation
Transformer-based LM grows substantially with its model size,
attributing to the fact that they can be unsupervisedly trained
on almost unlimited text data
GPT-3 can have more than 175B parameters, which amounts
to 350 GB(16-bit format)
This significantly exceeds the memory capacity of existing
hardware accelerators, such as GPUs and TPUs, which makes
model-parallel training a necessity
4. Related Works Operation Partition
Megatron-LM (2019, Nvidia)
Model partition on NLPs matrix multiplication
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
MLP 豌覯讌 伎伎 螻 A column-wise partition
覯讌 伎伎 螻 B row-wise partition
螳 partition襷 accelerator(GPU or TPU) 襯 覦一
Cross-device communication 覦
Ex. All-reduce
蟲 f, g all-reduce ops襯
豢螳覃 .
f backward all-reduce,
g forward all-reduce
5. Related Works Operation Partition
Megatron-LM (2019, Nvidia)
Model partition on Self attentions head
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
Accelerator螳 轟 head
Attention 螻
f,g 覦/覦レ all-reduce 企
6. Related Works Pipeline Partition
Gpipe (2019, Google)
Layer-wise pipeline
A minibatch to microBatches
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
覯觚
覯觚 讌.
7. Related Works Pipeline Partition
れ襦 覯觚 螻 旧螳 譴企.
L=Layer, Pipeline-K (GPU螳)
hdim=8096, head=32, batch=32
TPU 觜 298覦
覈 螳
K: TPU 螳, M= # of micro-batch in mini-batch
L=32
4 TPU 蠍一 mini-batch襯 4焔覃
3.2 x speed-up
15. 蟆磯
TeraPipe: high-performance token-level pipeline parallel algorithm for
training large-scale Transformer LM
Optimal pipeline execution scheme by DP
TeraPipe is orthogonal to other model parallel training methods and
can be combined with them
TeraPipe accelerates the synchronous training of the largest GPT-3
models with 175 billion parameters by 5.0x compared to previous
methods.