狠狠撸

ResNetと派生研究の紹介
2016-06-04
Masataka Nishimori

主旨
● ResNetとは何か？
● ResNetの派生研究ではどういったものがあるのか？
● 罢别苍蝉辞谤贵濒辞飞で実装してみて気づいたこと

ResNetとは
● 概要
- Deep Residual Network[1]の略称
- MSRA開発のImageNet 2015優勝アルゴリズム
- 残差(Residual)を取り入れることで多層でも性能劣化軽減
- ImageNetでは152層と非常に多層(従来は20層程度）
[1]. He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).
引用: He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

どれぐらい深いのか？
引用: Deep Residual Learning MSRA @ ILSVRC & COCO 2015 competitions
- 2014年優勝アルゴリズムの7倍近く層数が増加．
- 1000層以上のネットワークも論文中で提案．

深ければ良いのか？
● 少なくとも広いよりは深い方が良いらしい．[1]
[1]. Eldan, Ronen, and Ohad Shamir. "The Power of Depth for Feedforward Neural Networks." arXiv preprint
arXiv:1512.03965 (2015).

単純に多層にすると．．．
引用: He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).
● 従来は性能が悪くなる
● CIFAR 10の例(左: 従来, 右: ResNet)
● 多層だと従来は誤差増加

なぜ多層にするのが難しいのか？
● 勾配の消失
○ 原因
■ 逆誤差伝播で小さな重みが何度も乗算されるため[1]
○ 緩和方法
■ Careful Initialization[2]
■ Hidden Layer Supervision[3]
■ Batch Normalization[4]
■ ResNetのIdentity Mapping(後述)
[1]. Huang, Gao, et al. "Deep networks with stochastic depth." arXiv preprint arXiv:1603.09382 (2016).
[2]. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward
neural networks. In: International conference on artificial intelligence and statistics. (2010) 249–256
[3] Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. arXiv preprint arXiv:1409.5185 (2014)
[4] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

● 特徴量情報の劣化
○ 原因
■ FeedWorwardでランダムに初期化された重みによって特徴が消
えてしまい，後ろの層に伝わってくれないため[1]
○ 緩和方法
■ ResNetのIdentity Mapping(後述)

● 学習に時間がかかる
○ 原因
■ 層数が増えるほど計算時間も増加．
■ ResNetもImageNet用に数週間学習に費やす[1].
■ TITAN X(1台)だとCIFAR10で20層:2時間, 110層:半日程度
○ 緩和方法
■ 金と時間(ResNet)[2]
■ Dropoutで確率的に層数を変更[1](後述)
[2]. He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015).

ResNetのIdentity Mappingとは
従来のネットワーク
ショートカットする道を作り，何層も前の層を情報を足す．
この足し上げる部分のことをIdentity Mappingと呼ぶ．
引用: Deep Residual Learning MSRA @ ILSVRC & COCO 2015 competitions

なぜ解決できているのか？
うまく学習できているとき
● xが最適であれば，weight
layer部分は0になってショ
ートカット部分のみで良い．
● 最適付近なら，重みを少し
だけ更新してあげれば良い

なぜ解決できているのか？
● 前の前の層を足すことで，
Feed Forward時に特徴量の
情報の消失を防いでいる．
● 逆誤差伝播時にも消失が起
こりづらい形式で学習でき
るようになっている．

CIFAR 10での実験
左: 従来手法, 右: ResNet. 太線: テスト誤差, 破線: 検証誤差
ResNetをCIFAR 10で実験してみても，層数が増えるほど精度が上がる

ただ，いろいろと疑問は残る
● モデル構造
○ ほんとにその構造が最良?[1,2,3]
● 最適化手法
○ SGD+Momentumが最良?[3]
● 学習時間
○ なんとか節約できないか?[4]
結果，派生研究が大量に出現する．
[1]. He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv
preprint arXiv:1603.05027 (2016).
[2]. Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4,
inception-resnet and the impact of residual connections on learning." arXiv
preprint arXiv:1602.07261 (2016).
[3]. Training and investigating Residual Nets
[4]. Huang, Gao, et al. "Deep networks with stochastic depth." arXiv preprint
arXiv:1603.09382 (2016).

派生研究: モデル構造
● ResNet考案者の追加実験．
● BN(Batch Norm)とRELUの位置での性能評価
○ BNとReLUを畳み込みの前に行う方式が一番性能がよいとの報告
引用: He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016).

注). NSize=18は110層の意, BN: Batch Norm
そもそも最後のReLUが要らないという報告
引用: Training and investigating Residual Nets

実験: モデル構造
● 32層でCIFAR 10に適用
● 元の論文通りが最良
● 層数が増えると，BN, ReLU両方前が良いのかも

● Googleの論文
● Image Net ClassificationでResNet
を超える精度を出せるよう改良し
てみたという内容
● Top Error-5
○ ResNet: 3.57%
○ 本論文: 3.08%
[1]. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

知見
● 1000層を超えると不安定になってく
るので，0.1 ? 0.3倍をInception部分
にかけてあげると良い

派生研究: 最適化手法の変更
● 110層ReNetでCIFAR 10に適用
● 論文通りが最良
引用: Training and investigating Residual Nets

実験: 最適化手法の変更
自前でやってみても論文通りが最良
(32層ResNetでCIFAR 10に適用)

派生研究: 時間短縮
● 確率的にショートカットのみを残すようにす
ることで，時間短縮を実現．
● 従来のResNetよりも精度向上
引用: Huang, Gao, et al. "Deep networks with stochastic depth." arXiv preprint arXiv:1603.09382 (2016).

実装時に気づいたこと
● 重みの初期化方法に気をつける．
○ 0.01のガウス分布で適当に初期化とかするとダメ．
○ std = √(2/(k*k*c)) で初期化(k = カーネルサイズ, c = チャンネル数)[1]
● 畳み込み層ではバイアスを追加しないようにする．
● Adamを使っとけば良いとか思わない．
● Global Average Poolingは[2]参照
[1]. He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." Proceedings of the IEEE International Conference on Computer Vision. 2015.
[2]. Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

結論
● ResNet
○ 残差で100層以上でも安定して学習できるようになった
● 派生研究
○ モデル構造
■ 畳み込む前にBN+ReLUが良さそう
○ 最適化手法
■ SGD+Momentumが現状では最良
○ 時間短縮
■ Dropoutを使う．
● リポジトリ
○ https://github.com/namakemono/cifar10-tensorflow

References
[1] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint
arXiv:1512.03385 (2015).
ResNetの論文
[2] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by
reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
Batch Normについての論文
[3]. He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027
(2016).
ResNetのモデル構造に関する考察
[4]. He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet
classification." Proceedings of the IEEE International Conference on Computer Vision. 2015.
ResNetの重みの初期化方法記載

References
[5]. Training and investigating Residual Nets,
ResNetのモデルと最適化手法の変更による性能比較
[6]. CS231n Convolutional Neural Networks for Visual Recognition,
Leaning Rate変更による考察
[7]. Eldan, Ronen, and Ohad Shamir. "The Power of Depth for Feedforward Neural Networks." arXiv
preprint arXiv:1512.03965 (2015).
広くより深くのほうが性能高いことを説明している論文
Dropoutの導入で時間短縮を実現

狠狠撸

Res netと派生研究の紹介

More Related Content

Res netと派生研究の紹介