The document summarizes a research paper that compares the performance of MLP-based models to Transformer-based models on various natural language processing and computer vision tasks. The key points are:
1. Gated MLP (gMLP) architectures can achieve performance comparable to Transformers on most tasks, demonstrating that attention mechanisms may not be strictly necessary.
2. However, attention still provides benefits for some NLP tasks, as models combining gMLP and attention outperformed pure gMLP models on certain benchmarks.
3. For computer vision, gMLP achieved results close to Vision Transformers and CNNs on image classification, indicating gMLP can match their data efficiency.
The document discusses hyperparameter optimization in machine learning models. It introduces various hyperparameters that can affect model performance, and notes that as models become more complex, the number of hyperparameters increases, making manual tuning difficult. It formulates hyperparameter optimization as a black-box optimization problem to minimize validation loss and discusses challenges like high function evaluation costs and lack of gradient information.
BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
?
This document summarizes a research paper on modeling long-range dependencies in sequence data using structured state space models and deep learning. The proposed S4 model (1) derives recurrent and convolutional representations of state space models, (2) improves long-term memory using HiPPO matrices, and (3) efficiently computes state space model convolution kernels. Experiments show S4 outperforms existing methods on various long-range dependency tasks, achieves fast and memory-efficient computation comparable to efficient Transformers, and performs competitively as a general sequence model.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
The document discusses hyperparameter optimization in machine learning models. It introduces various hyperparameters that can affect model performance, and notes that as models become more complex, the number of hyperparameters increases, making manual tuning difficult. It formulates hyperparameter optimization as a black-box optimization problem to minimize validation loss and discusses challenges like high function evaluation costs and lack of gradient information.
BERT を中心に解説した資料です.BERT に比べると,XLNet と RoBERTa の内容は詳細に追ってないです.
あと,自作の図は上から下ですが,引っ張ってきた図は下から上になっているので注意してください.
もし間違い等あったら修正するので,言ってください.
(特に,RoBERTa の英語を読み間違えがちょっと怖いです.言い訳すいません.)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
本スライドは、弊社の梅本により弊社内の技術勉強会で使用されたものです。
近年注目を集めるアーキテクチャーである「Transformer」の解説スライドとなっております。
"Arithmer Seminar" is weekly held, where professionals from within and outside our company give lectures on their respective expertise.
The slides are made by the lecturer from outside our company, and shared here with his/her permission.
Arithmer株式会社は東京大学大学院数理科学研究科発の数学の会社です。私達は現代数学を応用して、様々な分野のソリューションに、新しい高度AIシステムを導入しています。AIをいかに上手に使って仕事を効率化するか、そして人々の役に立つ結果を生み出すのか、それを考えるのが私たちの仕事です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
?
This document summarizes a research paper on modeling long-range dependencies in sequence data using structured state space models and deep learning. The proposed S4 model (1) derives recurrent and convolutional representations of state space models, (2) improves long-term memory using HiPPO matrices, and (3) efficiently computes state space model convolution kernels. Experiments show S4 outperforms existing methods on various long-range dependency tasks, achieves fast and memory-efficient computation comparable to efficient Transformers, and performs competitively as a general sequence model.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
文献紹介:Big Bird: Transformers for Longer SequencesToru Tamaki
?
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed, Big Bird: Transformers for Longer Sequences, Advances in Neural Information Processing Systems 33 (NeurIPS 2020)
https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html
Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介Masayoshi Kondo
?
Neural Text Summarizationタスクの研究論文.ACL'17- long paper採択.スタンフォード大のD.Manning-labの博士学生とGoogle Brainの共同研究.長文データ(multi-sentences)に対して、生成時のrepetitionを回避するような仕組みをモデルに導入し、長文の要約生成を可能とした.ゼミでの論文紹介資料.論文URL : https://arxiv.org/abs/1704.04368
2. ? この発表は以下の3論文をまとめたものです
“Neural Machine Translation by jointly learning to
align and translate”
“Attention Is All You Need”
“BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding”
? 深層学習によるNLPで近年重要な”Attention”について
その起源と発展を振り返ります.
はじめに
2/34
3. Outline
“Neural Machine Translation by jointly learning to
align and translate”
?LSTMによる翻訳モデル
?Attention + RNN
“Attention Is All You Need”
?AttentionによるRNNの置換
?Self-AttentionとTransformer
“BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding”
?事前学習:Masked LMとNext Sentence Prediction
?BERTの性能
3/34
10. ? そもそもRNNいらないのでは?
? Sequenceを読み込ませるので計算が遅い
? 長い文章だと計算がうまくいかない(勾配消失or勾配爆発が理由)
? RNNをAttentionで置き換えよう
? Transformerの提案 ”Attention Is All You Need”
RNNからAttentionへ
https://adventuresinmachinelearning.com/
recurrent-neural-networks-lstm-tutorial-
tensorflow/
10/34
11. ? 著者/所属機関
? Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
? Google Brain, Google Research, University of Toronto
? 一言で言うと
? Encoder-DecoderモデルのRNNをAttentionで
置き換えたモデル,”Transformer”を提案.
短い訓練時間でありながら多くのタスクでSoTA.
論文 “Attention Is All You Need”
11/34
25. “Attention Is All You Need”まとめ
? メモリーつきAttentionをみた
? 自身の入力に注目するSelf-Attentionを導入,
構造からRNNを排したTransformerの完成
? 並列計算できる
? 可変長の入力にうまく対応できる
RNNの排除
25/34
26. ? 著者/所属機関
? Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
? Google AI Language
? 一言で言うと
? 多層に積み重ねたTransformerによって,
文章を文脈を考慮した単語表現にEmbeddingするモデル
論文 “BERT: pre-training of deep bidirectional transformers
for language understanding”
26/34
31. “For each task, we simply plug in the task-specific inputs and
outputs into BERT and finetune all the parameters end-to-end.”
Fine-tuningは事前学習済みモデルを特定の
タスク用に再学習することを指す
例)
クラス分類ではInputの文頭に
[CLS]トークンを置き.その位置の
BERT出力にネットワークをかませて予測する
BERTのfine-tuning
31/34
34. 参考文献(論文以外)
論文解説 Attention Is All You Need (Transformer)
? http://deeplearning.hatenablog.com/entry/transformer
作って理解する Transformer / Attention
? https://qiita.com/halhorn/items/c91497522be27bde17ce
The Illustrated Transformer
? https://jalammar.github.io/illustrated-transformer/
Neural Machine Translation with Attention
? https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention
Transformer model for language understanding
? https://www.tensorflow.org/beta/tutorials/text/transformer
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
? http://jalammar.github.io/illustrated-bert/
ゼロから作るDeep Learning② - 自然言語処理編
? 斎藤 康毅, 2018/07/21, オライリー社
34/34
#4: Seq2SeqでAttentionという概念が初めて出てきた.
現在のAttentionとは少し違う,その時のAttentionがどのようなものだったか.
2017年のAttention Is All You NeedはSeq2SeqでAttentionが発明されたときから2年弱経っている.
それまでにAttentionがどのような遷移をたどってきたのか.
最後にNLPの汎用的な事前学習モデルであるBERTについて話す.