際際滷

際際滷Share a Scribd company logo
Review by Seong Hoon Jung
hoondori@gmail.com
2021.03
Apache spark
Ray
Motivation
 Transformer-based LM grows substantially with its model size,
attributing to the fact that they can be unsupervisedly trained
on almost unlimited text data
 GPT-3 can have more than 175B parameters, which amounts
to 350 GB(16-bit format)
 This significantly exceeds the memory capacity of existing
hardware accelerators, such as GPUs and TPUs, which makes
model-parallel training a necessity
Review: Transformer Architecture
Left-to-right attention,
Multi-head
螳蟆 X * A * B 豌 2覯 螻煙朱 危危覃 .
Related Works  Operation Partition
 Megatron-LM (2019, Nvidia)
 Model partition on NLPs matrix multiplication
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
MLP 豌覯讌 伎伎 螻 A column-wise partition
覯讌 伎伎 螻 B row-wise partition
螳 partition襷 accelerator(GPU or TPU) 襯 覦一
Cross-device communication 覦
Ex. All-reduce
蟲 f, g  all-reduce ops襯
豢螳覃 .
f backward all-reduce,
g forward all-reduce
Related Works  Operation Partition
 Megatron-LM (2019, Nvidia)
 Model partition on Self attentions head
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
Accelerator螳 轟 head
Attention  螻
f,g 覦/覦レ all-reduce 企
Related Works  Pipeline Partition
 Gpipe (2019, Google)
 Layer-wise pipeline
 A minibatch to microBatches
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
覯觚 
覯觚  讌.
Related Works  Pipeline Partition
れ襦 覯觚 螻 旧螳 譴企.
L=Layer, Pipeline-K (GPU螳)
hdim=8096, head=32, batch=32
 TPU 觜 298覦
 覈  螳
K: TPU 螳, M= # of micro-batch in mini-batch
L=32
4 TPU 蠍一 mini-batch襯 4焔覃
3.2 x speed-up
Microbatch 豌襴 覓語
 覓語 豕蠍語(sequence length)螳 蠍語伎覃 覃覈襴 螻襦 
伎 覦一蠍郁 讌  覦 .
覃覈襴 fit = 覓語 蠍語 x 覦一 蠍 手 覃
覓語 蠍語願 2覦 企覃 覦一蠍磯 覦朱 譴 .
覯觚 企
TeraPipe : Token-based pipeline
覓語 蠍一
覲 螻
Token 蠍一
覲 螻
 危 t+1 譟危讌 
FF 蠍 襷 
覩碁 磯れ 企 螻手碓 磯れ
伎 伎伎 覦朱 豌襴  .
覯觚 譴企.
 覓語 蠍語願 蟒る 蠍語伎
企 螻手 り .
覓語レ 蠏狩蟆 讓手覃 觜企.
Left-right self-attention 轟煙
覓語 朱 螳覃伎 豌襴螳 蠍語伎.
襯 れ 16螳 企 蟲焔 覓語レ 4焔
覃 t1 4螳 願 レ伎襷,
T2 8螳 , t3 12螳 , t4
16螳 願 レ .
=> 覯觚 企.
覓語レ 蠏燕蟆 襯企 蟆  螳
觜訣伎蟆 襯企 蟆  螻殊企.
覦  豐 螳 T襯 豕    覓語 襯願鍵 scheme
豢
Knapsack problem with fixed t_max => Dynamic Programming
Parallel 譟壱
Data Parallel
: 覦一 覿 with 覈 覲旧
Pipeline Parallel
(micro-batch, token) : 豸給, 覦一 覿, Token 覿
Operation Parallel : 螻煙 覿
c.f) Swith Transformer
 expert parallel
覈 蠍 貉れ
= Small Batch
= DataParallel  螳
= Micro-batch 覯觚 貉れ
=> Token parallel 譴伎
(2), (3)  誤螻 覈 譟壱 token parallel襦 誤伎 speed up
(2), (3)  DP螳 token parallel 讌 蠍磯 蟆一 蟆曙
麹 Large覈語
讌螳襯 覦
Slice襯 襷 讓手り 襷 螳 譴企 蟆 .
- ex) 16螳 伎 讓手覃 ろ 覯觚 企  
DP(螳 蠏煙 覈襦 豕)襦 谿場 覿 scheme 麹.
覓語 蠍語企ゼ 蠍瑚  蟆曙 Gpipe 覲企 TeraPipe螳
 Speed up  譬.
蟆磯
 TeraPipe: high-performance token-level pipeline parallel algorithm for
training large-scale Transformer LM
 Optimal pipeline execution scheme by DP
 TeraPipe is orthogonal to other model parallel training methods and
can be combined with them
 TeraPipe accelerates the synchronous training of the largest GPT-3
models with 175 billion parameters by 5.0x compared to previous
methods.

More Related Content

[Paper review] tera pipe: token level pipeline parallelism for training large-scale language models, berkely, 2021

  • 1. Review by Seong Hoon Jung hoondori@gmail.com 2021.03 Apache spark Ray
  • 2. Motivation Transformer-based LM grows substantially with its model size, attributing to the fact that they can be unsupervisedly trained on almost unlimited text data GPT-3 can have more than 175B parameters, which amounts to 350 GB(16-bit format) This significantly exceeds the memory capacity of existing hardware accelerators, such as GPUs and TPUs, which makes model-parallel training a necessity
  • 3. Review: Transformer Architecture Left-to-right attention, Multi-head 螳蟆 X * A * B 豌 2覯 螻煙朱 危危覃 .
  • 4. Related Works Operation Partition Megatron-LM (2019, Nvidia) Model partition on NLPs matrix multiplication Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia MLP 豌覯讌 伎伎 螻 A column-wise partition 覯讌 伎伎 螻 B row-wise partition 螳 partition襷 accelerator(GPU or TPU) 襯 覦一 Cross-device communication 覦 Ex. All-reduce 蟲 f, g all-reduce ops襯 豢螳覃 . f backward all-reduce, g forward all-reduce
  • 5. Related Works Operation Partition Megatron-LM (2019, Nvidia) Model partition on Self attentions head Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia Accelerator螳 轟 head Attention 螻 f,g 覦/覦レ all-reduce 企
  • 6. Related Works Pipeline Partition Gpipe (2019, Google) Layer-wise pipeline A minibatch to microBatches GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 覯觚 覯觚 讌.
  • 7. Related Works Pipeline Partition れ襦 覯觚 螻 旧螳 譴企. L=Layer, Pipeline-K (GPU螳) hdim=8096, head=32, batch=32 TPU 觜 298覦 覈 螳 K: TPU 螳, M= # of micro-batch in mini-batch L=32 4 TPU 蠍一 mini-batch襯 4焔覃 3.2 x speed-up
  • 8. Microbatch 豌襴 覓語 覓語 豕蠍語(sequence length)螳 蠍語伎覃 覃覈襴 螻襦 伎 覦一蠍郁 讌 覦 . 覃覈襴 fit = 覓語 蠍語 x 覦一 蠍 手 覃 覓語 蠍語願 2覦 企覃 覦一蠍磯 覦朱 譴 . 覯觚 企
  • 9. TeraPipe : Token-based pipeline 覓語 蠍一 覲 螻 Token 蠍一 覲 螻 危 t+1 譟危讌 FF 蠍 襷 覩碁 磯れ 企 螻手碓 磯れ 伎 伎伎 覦朱 豌襴 .
  • 10. 覯觚 譴企. 覓語 蠍語願 蟒る 蠍語伎 企 螻手 り .
  • 11. 覓語レ 蠏狩蟆 讓手覃 觜企. Left-right self-attention 轟煙 覓語 朱 螳覃伎 豌襴螳 蠍語伎. 襯 れ 16螳 企 蟲焔 覓語レ 4焔 覃 t1 4螳 願 レ伎襷, T2 8螳 , t3 12螳 , t4 16螳 願 レ . => 覯觚 企. 覓語レ 蠏燕蟆 襯企 蟆 螳 觜訣伎蟆 襯企 蟆 螻殊企. 覦 豐 螳 T襯 豕 覓語 襯願鍵 scheme 豢 Knapsack problem with fixed t_max => Dynamic Programming
  • 12. Parallel 譟壱 Data Parallel : 覦一 覿 with 覈 覲旧 Pipeline Parallel (micro-batch, token) : 豸給, 覦一 覿, Token 覿 Operation Parallel : 螻煙 覿 c.f) Swith Transformer expert parallel
  • 13. 覈 蠍 貉れ = Small Batch = DataParallel 螳 = Micro-batch 覯觚 貉れ => Token parallel 譴伎 (2), (3) 誤螻 覈 譟壱 token parallel襦 誤伎 speed up (2), (3) DP螳 token parallel 讌 蠍磯 蟆一 蟆曙 麹 Large覈語 讌螳襯 覦
  • 14. Slice襯 襷 讓手り 襷 螳 譴企 蟆 . - ex) 16螳 伎 讓手覃 ろ 覯觚 企 DP(螳 蠏煙 覈襦 豕)襦 谿場 覿 scheme 麹. 覓語 蠍語企ゼ 蠍瑚 蟆曙 Gpipe 覲企 TeraPipe螳 Speed up 譬.
  • 15. 蟆磯 TeraPipe: high-performance token-level pipeline parallel algorithm for training large-scale Transformer LM Optimal pipeline execution scheme by DP TeraPipe is orthogonal to other model parallel training methods and can be combined with them TeraPipe accelerates the synchronous training of the largest GPT-3 models with 175 billion parameters by 5.0x compared to previous methods.