�ݺ�ߣ

Switch Transformer
SCALING TO TRILLION PARAMETER MODELS
WITH SIMPLE AND EFFICIENT SPARSITY
M
B
T
2021.02
Review by Seong Hoon Jung
hoondori@gmail.com

?? ????
? Motivation
? Parameter ?? ??? ??? ????? ???��
? ? ??? ??�� ??? ?? 13 Billion(T5-XXL) ??
? ??? ??? Computation Budget ? ??? ??!
? ??? ??
? ??? dimension?? Mixture of Expert(MoE)? ????.
? Expert ???? parameter? ???? ????.
? ????? Single Expert? ???? FLOP? ???? ???.
? ????
? (+) 10x param up, ? ?? sample efficiency, ? ?? ??, FLOP? ??
? (-) ?? ??/???? memory ? ?? ??
T5 : Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Motivation
? ?? ?? ??? ??? ??? ???
? ?? ??, ???? ??, ?? ?? ??(Budget)
Scaling Laws for Neural Language Models (Kaplan, 2020)
increase the parameter count while keeping
the floating point operations (FLOPs) per example constant.
We achieve this by designing a sparsely activated model that efficiently uses
hardware designed for dense matrix multiplications such as GPUs and TPUs.

?? Tx ??? FF layer? N??
Token(after attended)? ?????
????.
Ex) I like movie ??
��I�� => FF
��Like�� => FF
��movie�� => FF
? Token? Router prob ???? ???(argmax)
?? expert FF layer?? ????.
Single Expert ???? ???? ??????
????? FLOP? ????.
???? ?? ?? expert? ?? token? ????
????.
Mixture of Expert ?? Sparsely activating ?? ????

Mixture of Expert Routing
The sparsely-gated mixture-of-experts layer, Shazeer, 2017
Shazeer? ??? token?? topK=2 ? expert ?? ?
expert layer ??? output?? ????? representation
??? ??? topK=1 (Single Expert)? ?????.
- For same FLOP
- For reducing communication cost (device to device)
- For better performance

Scaling Property of Switch Transformer
Expert ?? ????? ??? ?? ?? ?? Switch Tx? ?? Tx ?? sample efficiency? ??

Data / Model / Expert Parallism
W W L1 L2 L3
Agg for gradient
D D D D��
L1 L2
L3
L4
E1 E2
?? Expert? ??? ?? Data ?????
??/???? ??. => ?? Hadoop ???
?? expert?
??? machine? fit?
?? ???
Shardding ? ?

Upstream(Pre-training) ?? ??
? a masked language modeling task
? the model is trained to predict missing tokens.
MoE? Switch?? ??? expert=2? ??? ??
Switch/MoE ?? 128 expert
Switch Transformers outperform both
carefully tuned dense models and MoE
Transformers on a speed-quality basis

Multilingual pre-training ?? ??
mT5-base

Downstream (Fine-tuning) ?? ??
Switch? ???? ??? ??? ????? prediction FLOP? ??? ??

Toward Trillion Parameter Model
Switch-XXL
- T5-XXL FLOP? ?? ??? ??? ??
- Num head, num layer ? ??? single machine fitting? ???? Expert/Model Parallel
- ?? ?? communication cost? ??.
Switch-C ? Trillion
- Expert Parallel Only ? ???? ? ( num head? num layer? ?? single machine fitting ??)
- ?? expert ?? ?? ?? ?? ? ?? (64 => 2048)
- ?? dimension?? ??? FLOP? 1/10 ? ??

?? ???? Switch Tx => Distillation!!
?? ??? Switch-Base? teache? T5-base? student ? ???? distillation ?? ??
T5-base from scratch ?? Switch-base ??? distilled? ??? ??? 30% ?? (?? ?? ??)

?? ????
Expert ?? ?? ??
?? ?? heuristic (???)
???? ???? ??

Expert Routing Balancing
? ?? ????? ?(work)? ??? ??? ????.
Expert3? ??? ??(capacity=2) ?? ?? Batch? ??
Expert1? ??? ??(capacity=2) ?? ? Batch? ?? (??? ??? ??? ?? ?? => overflow)
Overflow? ???? ???
globally?? capacity factor? ?? ? ???
? ?? ??? ?? ????? expert??
? ??? ? ??.

Expert Routing Balancing
? Balancing? ???? Loss? ??? ??
?? ??? i?? token?? ?? ??
Ex) ?? ?? ? 100?? token? ?? ? ? ?? ??? i?? assig????
N?? ???
?? ??? i?? assign? ??? ??
Ex) ��hello��? 0.3, ��music��? 0.6? assig???? ? ?? ?=0.9
? uniform routing distribution across experts? ???

?? ??? ?? ?(?)??
? Smaller parameter initialization for stability.
? ???? MoE? ???? ???? ?? ??? ??? ?? ???
????(large ?? variance)
? ?? Tx?? ???? truncated normal random ??? variance??
10? ?? ? ?? variance? ????? ????? ?? ??

?? ??? ?? ?(?)??
? Large Expert Dropout
? Dropout @Non-expert layer < Dropout @expert-layer
? Ex) 0.1 < 0.4

?? ??? ?? ?(?)??
? Selective precision with large sparse models.
? ?? ???? Float32 ?? ??
? ??? ???. ??? ???
? ?? ???? bFloat16 ?? ??
? ??? ???. ??? ???
? Expert Expert layer ???? bFloat16 cast to float32 ? ??
? ???? bFloat16 ?? ??
? (+) device ?? communication ?? ?? (compared to float32 ??)
? (+) expert layer? ?? ??? ??

?? ???? ?????
? Switching self-attention
? Attention layer?? MoE ? ??????.

?? ???? ?????
? No-Token-Left-Behind
? Expert capacity? ??? ??? ?? ???? token?? next layer
? skip (??? skip connection?)
Expert 1?? assign???? overflow? ??? ????
Token? next stage?? second winning expert?? assign

?? ???? ?????
? Exploration across experts
? ? 1? expert?? token? assign?? ?? ?????
? Exploit v.s Explore ???? ??? sub-optimal?? ?? ??

�ݺ�ߣ

Switch transformers paper review

More Related Content

Switch transformers paper review