�ݺ�ߣ

Review by Seong Hoon Jung
hoondori@gmail.com
2021.03
Apache spark
Ray

Motivation
• Transformer-based LM grows substantially with its model size,
attributing to the fact that they can be unsupervisedly trained
on almost unlimited text data
• GPT-3 can have more than 175B parameters, which amounts
to 350 GB(16-bit format)
• This significantly exceeds the memory capacity of existing
hardware accelerators, such as GPUs and TPUs, which makes
model-parallel training a necessity

Review: Transformer Architecture
Left-to-right attention,
Multi-head
간단하게는 X * A * B 처럼 2번의 행렬곱으로 이해하면 된다.

Related Works – Operation Partition
• Megatron-LM (2019, Nvidia)
• Model partition on NLP’s matrix multiplication
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
MLP에서 첫번째 레이어의 곱 A는 column-wise partition
두번째 레이어의 곱 B는 row-wise partition
각 partition마다 accelerator(GPU or TPU) 를 배치
Cross-device communication 유발
Ex. All-reduce
구현시에는 f, g 에 all-reduce ops를
추가하면 된다.
f는 backward의 all-reduce,
g는 forward의 all-reduce

Related Works – Operation Partition
• Megatron-LM (2019, Nvidia)
• Model partition on Self attention’s head
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019, Nvidia
Accelerator가 특정 head에서의
Attention 을 계산
f,g에서 정방향/역방향의 all-reduce 담당

Related Works – Pipeline Partition
• Gpipe (2019, Google)
• Layer-wise pipeline
• A minibatch to microBatches
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
버블이 크다
버블이 더 작아진다.

Related Works – Pipeline Partition
실제로는 버블도 작고 학습시간도 줄어든다.
L=Layer수, Pipeline-K (GPU개수)
hdim=8096, head=32, batch=32
단일 TPU 사용대비 298배
큰 모델 학습 가능
K: TPU 개수, M= # of micro-batch in mini-batch
L=32
4 TPU 기준 mini-batch를 4등분하면
3.2 x speed-up

Microbatch 처리의 문제점
• 문장 최대길이(sequence length)가 길어지면 메모리 한계로 인
해서 배치크기가 작아질 수 밖에 없다.
메모리 fit = 문장 길이 x 배치 크기 라고 하면
문장 길이가 2배 늘어나면 배치크기는 반으로 줄여야 한다.
버블이 늘어난다

TeraPipe : Token-based pipeline
문장 기준
병렬 계산
Token 기준
병렬 계산
셀프 어텐션은 t+1에 의존하지 않음
FF는 자기 자신만 필요
미래 토큰들이 없어도 과거 토큰들이
이전 레이어에서 전달받았으면 처리할 수 있다.

버블이 줄어든다.
단 문장 길이가 꽤나 길어야
이런 효과가 나타난다고 한다.

문장을 균일하게 쪼개면 비효율적이다.
Left-right self-attention 특성상
문장 끝으로 가면서 처리시간이 길어진다.
예를 들어 16개 단어로 구성된 문장을 4등분
하면 t1에서는 4개 단어가 입력이지만,
T2에서는 8개 단어, t3에서는 12개 단어, t4에선
16개 단어가 입력이 된다.
=> 버블이 늘어난다.
문장을 균등하게 자르는 것이 아니라 소요시간이
비슷해지게 자르는 것이 더 효과적이다.
정방향 전파 총 소요시간 T를 최소화 할 수 있는 문장 자르기 scheme
도출
Knapsack problem with fixed t_max => Dynamic Programming

Parallel 조합
Data Parallel
: 배치 분할 with 모델 복제
Pipeline Parallel
(micro-batch, token) : 층분할, 배치 분할, Token 분할
Operation Parallel : 행렬곱의 분할
c.f) Swith Transformer
에서는 expert parallel

모델 크기 커짐
= Small Batch
= DataParallel 효용성 감소
= Micro-batch 버블 커짐
=> Token parallel 중요해짐
(2), (3) 을 제외하고는 모든 조합에서 token parallel로 인해서 speed up
(2), (3) 에서는 DP가 token parallel을 사용하지 않기로 결정한 경우
특히 Large모델에서는
진가를 발휘

Slice를 많이 쪼갠다고 마냥 시간이 줄어드는 것이 아니다.
- ex) 16개 이상 쪼개면 오히려 버블이 늘어나 속도 저하
DP(시간 균등을 목표로 최적화)로 찾은 분할 scheme이 적당했다.
문장 길이를 길게 하는 경우 Gpipe 보다 TeraPipe가
훨씬 Speed up 이 좋았다.

결론
• TeraPipe: high-performance token-level pipeline parallel algorithm for
training large-scale Transformer LM
• Optimal pipeline execution scheme by DP
• TeraPipe is orthogonal to other model parallel training methods and
can be combined with them
• TeraPipe accelerates the synchronous training of the largest GPT-3
models with 175 billion parameters by 5.0x compared to previous
methods.

�ݺ�ߣ

[Paper review] tera pipe: token level pipeline parallelism for training large-scale language models, berkely, 2021

More Related Content

[Paper review] tera pipe: token level pipeline parallelism for training large-scale language models, berkely, 2021