狠狠撸

狠狠撸Share a Scribd company logo
Online Real-time Multiple
Spatiotemporal Action Localisation
and Prediction
の紹介
2017年11月14日
西村仁志
紹介論文
ICCV2017
“Online Real-time Multiple Spatiotemporal Action Localisation and Prediction”
Gurkirt Singh1 Suman Saha1 Michael Sapienza2 Philip Torr2 Fabio Cuzzolin1
1Oxford Brookes University 2University of Oxford
https://arxiv.org/pdf/1611.08563.pdf
※ソースコードは公開”予定”らしい
感想
? 最近あまりできていなかったが、今回は精読に近い読
み方をした。多読のみでは技術のコア部分はほとんど
気にしないため、週に一本は精読したい
? ロボットや自動運転等の実応用向けという意味では、
重要な論文。特に我々企業にとっては参考になると思
う
? 複雑な理論や数式等はなく、シンプルな手法
? Oxfordの論文は、毎回明快で読みやすい(Deepの力技
ではなく、考察も豊富)
? 本手法は(というか最近の我々の分野の技術のほとん
どは)、色々な技術の合わせ技である。要素としては、
「低次特徴(rgb,flow)」「classification」
「localization(spatio / temporal)」「tube generation
(tracking)」「fusion」等がある
1. Introduction
? 背景:
1. spatio-temporal (S/T) action localisation and
classificationをreal timeでやりたい
2. action tubesをフレームごとにオンラインで予測し
たい
? 問題:computationally expensive and their
detection accuracy is still below what is needed
for real-world deployment
? 提案:上記を解決できるオンラインなフレーム
ワーク
2. Related work
3. Methodology
3.1. Optical flow computation
? Real-time optical flow (Fig. 2b) [16]
? As an option, one can compute optical flow more
accurately (Fig. 2c), using Brox et al.’s [1] method
? Transfer learning:first train the SSD network on
accurate flow results, to later transfer the learned
weights to initialise those of the real time OF
network
3.2. Integrated detection network
? We use a single-stage convolutional neural network
(Fig. 2e) for bounding box prediction and
classification, which follows an end-to-end
trainable architecture proposed in [22].
? The architecture unifies a number of functionalities
in single CNN which are, in other action and object
detectors, performed by separate components [7,
53, 30, 33]
1. region proposal generation
2. bounding box prediction
3. estimation of class-specific confidence scores for the
predicted boxes
3.3. Fusion of appearance and flow cues
1. Boost-fusion
2. Fusion by taking the union-set
3.4. Online action tube generation
Action class Start / end time
Bounding boxAction tube
Require:
1. consecutive detections part of an action tube to have
spatial overlap above a threshold
2. each class-specific detection to belong to a single action
tube
3. the online update of the tubes’ temporal label
? we propose a simple but efficient online action tube
generation algorithm
? incrementally (frame by frame) builds multiple action tubes
for each action class in parallel
3.4.1 A novel greedy algorithm
t
t-1
Tube list
(時刻t-1の)
??
1
??
2
high
score
tube内のdetection boxの
スコアの平均
potential match
list (時刻tの)? ?
? ?
1
? ?
2
??
1
??
2
? ?
2
? ?
1
highest!
tubeに対してマッ
チするboxがなけ
ればtubeを消去
tubeのラベルを更新
残ったboxは新しい
tubeとする
3.4.2 Temporal labelling
? = {?1, ?2,?, ? ?}
?1
?2
?3
c is the tube’s class label
0 denotes the background class
? ? ∈ {?, 0}
同じaction(または背景)同士は、隣り合いやすくさせる
スコアが高いaction(または背景)は、残りやすくさせる
4. Experiments
? Test
1. Early action prediction (x 4.1)
2. Online spatio-temporal action localization
? Dataset
? UCF-101-24:Although each video only contains a single action
category, it may contain multiple action instances (upto 12 in a video)
of the same action class, with different spatial and temporal
boundaries.
? J-HMDB-21 [12] is a subset of the HMDB-51 dataset [17] with 21 action
categories and 928 videos, each containing a single action instance and
trimmed to the action’s duration.
? Evaluation metrics
? AUC (area under the curve)
? mAP (mean average precision)
? Video Observation Percentage: with respect to the portion (%) of the
entire video observed before predicting action label and location
4.1. Early action label prediction
? 提案手法が高い精度(特に、VOPが低い場合)
? UCF > JHMDBなのは、学習データ数の関係
? RAF ≒ AF →optical flowは “classification”においては重要でない
4.2. Online spatiotemporal action localisation
15
4.2.1 Performance over time
? Fig4: Soomro et al.はwarm-upしてから、精度が低下していくのに対し、
提案手法は安定的
? Fig5: UCF101のδ=0.5のときに低下していくのは、temporally
? untrimmed and contain multiple action instancesだから
vs online method
4.2. Online spatiotemporal action localisation
4.2.2 Global performance
? δ=0.5(より現実的な厳しい条件下)において、提案手法は特に有効
? flowは有効(特に、JHMDB-21において)
? boost-fusion < union-set
? オフラインにも勝っている →オフラインに改善の余地がある
4.4. Test time detection speed
? Intel Xeon CPU@2.80GHz (8 cores)
? two NVIDIA Titan X GPUs
? For action tube generation, we ran 8 CPU threads in parallel for
each class
? our framework is able to detect multiple co-occurring action
instances in real-time, while retaining very competitive
performance.
5. Conclusions and future plans
? Conclusions
? We presented a novel online framework for action
localization and prediction able to address the challenges
involved in concurrent multiple human action recognition,
spatial localisation and temporal detection, in real time.
? for its application to real-time applications such as
autonomous driving, human robot interaction and surgical
robotics
? Feature plans
? Motion vectors [60] →faster detection speeds
? faster frame level detector, such as YOLO [29]
? More sophisticated online tracking algorithms [54] for tube
generation

More Related Content

Online real time multiple spatiotemporal action localisation and predictionの紹介

  • 1. Online Real-time Multiple Spatiotemporal Action Localisation and Prediction の紹介 2017年11月14日 西村仁志
  • 2. 紹介論文 ICCV2017 “Online Real-time Multiple Spatiotemporal Action Localisation and Prediction” Gurkirt Singh1 Suman Saha1 Michael Sapienza2 Philip Torr2 Fabio Cuzzolin1 1Oxford Brookes University 2University of Oxford https://arxiv.org/pdf/1611.08563.pdf ※ソースコードは公開”予定”らしい
  • 3. 感想 ? 最近あまりできていなかったが、今回は精読に近い読 み方をした。多読のみでは技術のコア部分はほとんど 気にしないため、週に一本は精読したい ? ロボットや自動運転等の実応用向けという意味では、 重要な論文。特に我々企業にとっては参考になると思 う ? 複雑な理論や数式等はなく、シンプルな手法 ? Oxfordの論文は、毎回明快で読みやすい(Deepの力技 ではなく、考察も豊富) ? 本手法は(というか最近の我々の分野の技術のほとん どは)、色々な技術の合わせ技である。要素としては、 「低次特徴(rgb,flow)」「classification」 「localization(spatio / temporal)」「tube generation (tracking)」「fusion」等がある
  • 4. 1. Introduction ? 背景: 1. spatio-temporal (S/T) action localisation and classificationをreal timeでやりたい 2. action tubesをフレームごとにオンラインで予測し たい ? 問題:computationally expensive and their detection accuracy is still below what is needed for real-world deployment ? 提案:上記を解決できるオンラインなフレーム ワーク
  • 7. 3.1. Optical flow computation ? Real-time optical flow (Fig. 2b) [16] ? As an option, one can compute optical flow more accurately (Fig. 2c), using Brox et al.’s [1] method ? Transfer learning:first train the SSD network on accurate flow results, to later transfer the learned weights to initialise those of the real time OF network
  • 8. 3.2. Integrated detection network ? We use a single-stage convolutional neural network (Fig. 2e) for bounding box prediction and classification, which follows an end-to-end trainable architecture proposed in [22]. ? The architecture unifies a number of functionalities in single CNN which are, in other action and object detectors, performed by separate components [7, 53, 30, 33] 1. region proposal generation 2. bounding box prediction 3. estimation of class-specific confidence scores for the predicted boxes
  • 9. 3.3. Fusion of appearance and flow cues 1. Boost-fusion 2. Fusion by taking the union-set
  • 10. 3.4. Online action tube generation Action class Start / end time Bounding boxAction tube Require: 1. consecutive detections part of an action tube to have spatial overlap above a threshold 2. each class-specific detection to belong to a single action tube 3. the online update of the tubes’ temporal label ? we propose a simple but efficient online action tube generation algorithm ? incrementally (frame by frame) builds multiple action tubes for each action class in parallel
  • 11. 3.4.1 A novel greedy algorithm t t-1 Tube list (時刻t-1の) ?? 1 ?? 2 high score tube内のdetection boxの スコアの平均 potential match list (時刻tの)? ? ? ? 1 ? ? 2 ?? 1 ?? 2 ? ? 2 ? ? 1 highest! tubeに対してマッ チするboxがなけ ればtubeを消去 tubeのラベルを更新 残ったboxは新しい tubeとする
  • 12. 3.4.2 Temporal labelling ? = {?1, ?2,?, ? ?} ?1 ?2 ?3 c is the tube’s class label 0 denotes the background class ? ? ∈ {?, 0} 同じaction(または背景)同士は、隣り合いやすくさせる スコアが高いaction(または背景)は、残りやすくさせる
  • 13. 4. Experiments ? Test 1. Early action prediction (x 4.1) 2. Online spatio-temporal action localization ? Dataset ? UCF-101-24:Although each video only contains a single action category, it may contain multiple action instances (upto 12 in a video) of the same action class, with different spatial and temporal boundaries. ? J-HMDB-21 [12] is a subset of the HMDB-51 dataset [17] with 21 action categories and 928 videos, each containing a single action instance and trimmed to the action’s duration. ? Evaluation metrics ? AUC (area under the curve) ? mAP (mean average precision) ? Video Observation Percentage: with respect to the portion (%) of the entire video observed before predicting action label and location
  • 14. 4.1. Early action label prediction ? 提案手法が高い精度(特に、VOPが低い場合) ? UCF > JHMDBなのは、学習データ数の関係 ? RAF ≒ AF →optical flowは “classification”においては重要でない
  • 15. 4.2. Online spatiotemporal action localisation 15 4.2.1 Performance over time ? Fig4: Soomro et al.はwarm-upしてから、精度が低下していくのに対し、 提案手法は安定的 ? Fig5: UCF101のδ=0.5のときに低下していくのは、temporally ? untrimmed and contain multiple action instancesだから vs online method
  • 16. 4.2. Online spatiotemporal action localisation 4.2.2 Global performance ? δ=0.5(より現実的な厳しい条件下)において、提案手法は特に有効 ? flowは有効(特に、JHMDB-21において) ? boost-fusion < union-set ? オフラインにも勝っている →オフラインに改善の余地がある
  • 17. 4.4. Test time detection speed ? Intel Xeon CPU@2.80GHz (8 cores) ? two NVIDIA Titan X GPUs ? For action tube generation, we ran 8 CPU threads in parallel for each class ? our framework is able to detect multiple co-occurring action instances in real-time, while retaining very competitive performance.
  • 18. 5. Conclusions and future plans ? Conclusions ? We presented a novel online framework for action localization and prediction able to address the challenges involved in concurrent multiple human action recognition, spatial localisation and temporal detection, in real time. ? for its application to real-time applications such as autonomous driving, human robot interaction and surgical robotics ? Feature plans ? Motion vectors [60] →faster detection speeds ? faster frame level detector, such as YOLO [29] ? More sophisticated online tracking algorithms [54] for tube generation