4. 1. Introduction
? 背景:
1. spatio-temporal (S/T) action localisation and
classificationをreal timeでやりたい
2. action tubesをフレームごとにオンラインで予測し
たい
? 問題:computationally expensive and their
detection accuracy is still below what is needed
for real-world deployment
? 提案:上記を解決できるオンラインなフレーム
ワーク
7. 3.1. Optical flow computation
? Real-time optical flow (Fig. 2b) [16]
? As an option, one can compute optical flow more
accurately (Fig. 2c), using Brox et al.’s [1] method
? Transfer learning:first train the SSD network on
accurate flow results, to later transfer the learned
weights to initialise those of the real time OF
network
8. 3.2. Integrated detection network
? We use a single-stage convolutional neural network
(Fig. 2e) for bounding box prediction and
classification, which follows an end-to-end
trainable architecture proposed in [22].
? The architecture unifies a number of functionalities
in single CNN which are, in other action and object
detectors, performed by separate components [7,
53, 30, 33]
1. region proposal generation
2. bounding box prediction
3. estimation of class-specific confidence scores for the
predicted boxes
9. 3.3. Fusion of appearance and flow cues
1. Boost-fusion
2. Fusion by taking the union-set
10. 3.4. Online action tube generation
Action class Start / end time
Bounding boxAction tube
Require:
1. consecutive detections part of an action tube to have
spatial overlap above a threshold
2. each class-specific detection to belong to a single action
tube
3. the online update of the tubes’ temporal label
? we propose a simple but efficient online action tube
generation algorithm
? incrementally (frame by frame) builds multiple action tubes
for each action class in parallel
11. 3.4.1 A novel greedy algorithm
t
t-1
Tube list
(時刻t-1の)
??
1
??
2
high
score
tube内のdetection boxの
スコアの平均
potential match
list (時刻tの)? ?
? ?
1
? ?
2
??
1
??
2
? ?
2
? ?
1
highest!
tubeに対してマッ
チするboxがなけ
ればtubeを消去
tubeのラベルを更新
残ったboxは新しい
tubeとする
12. 3.4.2 Temporal labelling
? = {?1, ?2,?, ? ?}
?1
?2
?3
c is the tube’s class label
0 denotes the background class
? ? ∈ {?, 0}
同じaction(または背景)同士は、隣り合いやすくさせる
スコアが高いaction(または背景)は、残りやすくさせる
13. 4. Experiments
? Test
1. Early action prediction (x 4.1)
2. Online spatio-temporal action localization
? Dataset
? UCF-101-24:Although each video only contains a single action
category, it may contain multiple action instances (upto 12 in a video)
of the same action class, with different spatial and temporal
boundaries.
? J-HMDB-21 [12] is a subset of the HMDB-51 dataset [17] with 21 action
categories and 928 videos, each containing a single action instance and
trimmed to the action’s duration.
? Evaluation metrics
? AUC (area under the curve)
? mAP (mean average precision)
? Video Observation Percentage: with respect to the portion (%) of the
entire video observed before predicting action label and location
14. 4.1. Early action label prediction
? 提案手法が高い精度(特に、VOPが低い場合)
? UCF > JHMDBなのは、学習データ数の関係
? RAF ≒ AF →optical flowは “classification”においては重要でない
15. 4.2. Online spatiotemporal action localisation
15
4.2.1 Performance over time
? Fig4: Soomro et al.はwarm-upしてから、精度が低下していくのに対し、
提案手法は安定的
? Fig5: UCF101のδ=0.5のときに低下していくのは、temporally
? untrimmed and contain multiple action instancesだから
vs online method
17. 4.4. Test time detection speed
? Intel Xeon CPU@2.80GHz (8 cores)
? two NVIDIA Titan X GPUs
? For action tube generation, we ran 8 CPU threads in parallel for
each class
? our framework is able to detect multiple co-occurring action
instances in real-time, while retaining very competitive
performance.
18. 5. Conclusions and future plans
? Conclusions
? We presented a novel online framework for action
localization and prediction able to address the challenges
involved in concurrent multiple human action recognition,
spatial localisation and temporal detection, in real time.
? for its application to real-time applications such as
autonomous driving, human robot interaction and surgical
robotics
? Feature plans
? Motion vectors [60] →faster detection speeds
? faster frame level detector, such as YOLO [29]
? More sophisticated online tracking algorithms [54] for tube
generation