Several recent papers have explored self-supervised learning methods for vision transformers (ViT). Key approaches include:
1. Masked prediction tasks that predict masked patches of the input image.
2. Contrastive learning using techniques like MoCo to learn representations by contrasting augmented views of the same image.
3. Self-distillation methods like DINO that distill a teacher ViT into a student ViT using different views of the same image.
4. Hybrid approaches that combine masked prediction with self-distillation, such as iBOT.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
The document discusses using the Raspberry Pi GPU for deep neural network prediction on end devices. It provides an overview of the Raspberry Pi GPU architecture and benchmarks convolutional neural network models like GoogLeNet, ResNet50, and YOLO on the Raspberry Pi 3 and Zero. Optimization techniques discussed include specialized convolution implementations, instruction golfing to reduce operations, removing wasteful computations, and improving data locality.
1. DEEP LEARNING JP [DL Papers]
“YOLO9000: Better, Faster, Stronger” (CVPR’17 Best Paper)
And the History of Object Detection
Makoto Kawano, Keio University
http://deeplearning.jp/
1
2. 書誌情報
? CVPR2017 Best Paper Award
? Joseph Redmon, Ali Farhadi(ワシントン大学)
? 選定理由:
? YOLOという前バージョン(同じ著者たち+α)の存在を知っていた
? バージョンアップして,ベストペーパーに選ばれたことを耳にしたから
? この論文を中心に物体検出の歴史みたいなものを話します
? R-CNN(2014)~Mask R-CNN(2017)
? R-CNN, SPPNet, Fast R-CNN, Faster R-CNN, YOLO, SSD, YOLO9000, (Mask R-CNNのさわりだけ)
? ほとんど触れたことがない分野で,宣言したことをものすごく後悔
? 結構独断と偏見に満ち溢れているので,間違ってたら指摘お願いします
2
20. YOLOv2
? アーキテクチャの工夫
? ①全Conv層にBatch Normalizationを入れる
? 収束を速くし,正則化の効果を得る
? ②新しい構造Darknet-19にする
? VGG16のように3×3のフィルタサイズ
? Network In NetworkのGlobal Average Poolingを使う
? ③Passthroughを入れる(わからない)
? add a passthrough layer from the final 3 × 3 × 512
layer to the second to last convolutional layer
20