2021? ?? ????? ??? ??? ?? ?????.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
https://arxiv.org/abs/2101.03961
1 of 21
Download to read offline
More Related Content
Switch transformers paper review
1. Switch Transformer
SCALING TO TRILLION PARAMETER MODELS
WITH SIMPLE AND EFFICIENT SPARSITY
M
B
T
2021.02
Review by Seong Hoon Jung
hoondori@gmail.com
3. Motivation
? ?? ?? ??? ??? ??? ???
? ?? ??, ???? ??, ?? ?? ??(Budget)
Scaling Laws for Neural Language Models (Kaplan, 2020)
increase the parameter count while keeping
the floating point operations (FLOPs) per example constant.
We achieve this by designing a sparsely activated model that efficiently uses
hardware designed for dense matrix multiplications such as GPUs and TPUs.
7. Data / Model / Expert Parallism
W W L1 L2 L3
Agg for gradient
D D D D¨
L1 L2
L3
L4
E1 E2
?? Expert? ??? ?? Data ?????
??/???? ??. => ?? Hadoop ???
?? expert?
??? machine? fit?
?? ???
Shardding ? ?
8. Upstream(Pre-training) ?? ??
? a masked language modeling task
? the model is trained to predict missing tokens.
MoE? Switch?? ??? expert=2? ??? ??
Switch/MoE ?? 128 expert
Switch Transformers outperform both
carefully tuned dense models and MoE
Transformers on a speed-quality basis