This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
This document summarizes recent research on applying self-attention mechanisms from Transformers to domains other than language, such as computer vision. It discusses models that use self-attention for images, including ViT, DeiT, and T2T, which apply Transformers to divided image patches. It also covers more general attention modules like the Perceiver that aims to be domain-agnostic. Finally, it discusses work on transferring pretrained language Transformers to other modalities through frozen weights, showing they can function as universal computation engines.
2. I
2
? Փ?Realtime Multi-Person 2D Pose Estimation using Part Affinity
Fields
C https://arxiv.org/abs/1611.08050
? ߣZhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh
C The Robotics Institute, Carnegie Mellon University
? _?24 Nov 2016
? CVPR 2017 Oral
? ݺߣ
? Video
? ؤ˶Ϥ꤬oϤ, ӛՓ?, ݺߣ, Video?
4. Abstract
? Ф}?2Dݩ`ʵĤ˗ʳ?Ƥ.
? ?؏
C Part Affinity Fields(PAFs)?岿λȂ?vBѧƤ.
C ܥȥॢåĥץ`ȫФ?}`ɤ, ?ˤ餺, ??
?ȤS.
C Sequential Prediction with Learned Spatial ContextCNNy?˥åȤ
R귵.
C Jointly Learning Parts detection and Parts AssociationλΈȤvB
ͬѧ
? Y
C COCO2016keypoints challenge1λ, MPII Multi-Person benchmarkˤƄ
?ȤȤsotaϻؤä.
4
5. Introduction
? ȴоǤ, }?νj״rǤ??岿λ֪y}Ȥ֪Ƥ
.
C ?, `뤬
C ?ͬ?Υ饯
C ??Ӌ
? top-downĥץ`?Η֪?, ˂?˄ƶ?.
C ? ڤ?Η֪ʧȡ˄ƶǤʤ
C ? ??Ӌ
? bottom-upĥץ` Փ?Ϥä.
C ? ? ӛ٤ˌ, Х.
C ? ? Ӌ֤
C ? ȴ?Ǥ, 岿λ?Υ`ХʥƥֱӵĤˤʹƤ餺,
λΤĤʤϤ碌ƶӋ, ?.
? Փ?Ǥ, .
? Փ?Ǥ, bottom-upץ`, }?Υݩ`ƶSoTAξȤ_.
C Part Affinity Fields(PAFs)ˤäƲλͬ?vBȤF岿λΈ, `ɤ2
Ԫ٥ȥ.
C λ֪vBȤΥܥȥॢåĤʱFͬrƶ뤳Ȥ, `ХʥƥȤ?
֥`ɤǤ. ˤ, ???٤gF.
5
6. Method
? Feed forward network ˤ, 岿λλä confidence map S (b)
, ٥DzgvBȤ`ɤ affinity fields L (c).
C S = (S1, S2, , SJ) Jconfidence maps֤. Jβλˤ줾쌝.
? Sj Rwh, j {1J}
C L = (L1, L2, , LC) CΥ٥֤, 줾λڥˌ.
? Lc R wh2, c {1C}
? confidence mapaffinity fieldsԪ, Bipartite Matching(2gvS
Ӌ)?(d)
6
<???죾
?? ?
9. Method > Confidence Maps for Part Detection
? ʽ5ǤuΤ, map S*ΥƩ`줿2D`ݥ
Ȥ?ɤ.
? }?, λ?Ȥ, λj, ?kˌꤹconfidence map
Υԩ`ڤ٤.
? ޤ, confidence maps S*j,k (₎)?k?ɤ.
C xj,kR2λj?kλФλäȤ, S*jkˤpR2¤Τ
˶x.
? : ԩ`ΎڤߺϤ{
C ͥåȥ`yʽ˺Ϥ碌, ȫƤ?vmap碌,
λvconfidence map¤Τ褦ˤ. (maxڥ`)
9
10. Method > Confidence Maps for Part Detection
? Confidence mapsƽȡΤǤϟo?ȡ褦ˤ뤳
ǡ˽_ΤޤޤǤ롣
? ƥȕrϡpredictconfidence maps?ơnon-maximum
suppression?Ȥǡ岿λyλää롣
C a?Non-maximum suppression铂ϤθϤǷֲ֤äƤϡ
?confidence ֤ķֲФ.
10
11. Method > Part Affinity Fields for Part Association
? ??v, ֪줿λͬ?ˤĤʤ뤫. (a)
C gȡ? }?äꤹȤޤĤʤϤ碌ʤ(b)
? λäΤߤΥ`ɤʤʤɱF?.
? Part affinity fieldsλ, F(c)
C limb(λΤĤʤϤ碌)ˌƸ2D٥ȥ
C limbvB벿λĤʤaffinity field֤.
11
12. Method > Part Affinity Fields for Part Association
? ٥ȥ낎ΛQ?
C xj1,k , xj2,k: ?klimb c ˤ벿λj1,j2₎Ȥ.
C Limb cϤpoint p_v, L*c,k(p)j1j2ؤ΅gλ٥ȥˤʤ. Limb c
point pˤƤȫzero٥ȥ. ʽ.
? v = (xj2,k-xj1,k) / || xj2,k-xj1,k ||2 limb΅gλ٥ȥ룩
C Limb cϤɤж¤ʽǶx.
? lϥԥx, lc,kϲλgΥ`åɾx.
12
13. Method > Part Affinity Fields for Part Association
? ͥåȥ`yΤ˺Ϥ碌ȫƤ?vfields碌
limbȤfield?. (average)
C nc(p)point p ˤk peopleФǥʤ٥ȥ(limbؤʤä
Ϥƽȡꤿ)
? ƥȕr, ꤹPAFξe֤aλλäY־֤ؤäӋ
㤹뤳Ȥ, aλgvBȤy. (= ʳ줿λY֤Ȥ
alimb, y줿PAFg?¶Ȥy)
C Ĥˤ, Ĥκaλdj1dj2ˌ, y줿PAF Lcλgξ֤ؤ
ƥץ, λvBȤconfidenceӋy.
? p(u)Ĥβλdj1dj2Ĥʤλ
C g?Ϥ, ?guӋƷe֤ƤƤ.
13
14. Method > Multi-Person Parsing using PAFs
? λv}֪꣨}? or Ԥ椨,
ֿܤlimbΥѥ`य.
? AFϤξeӋˤalimb˥Ĥ. mʽMߺϤ碌
?Ĥ놖}, NP-Hard}ǤKԪΥޥå}ˌꤹ.
C greedy ͷˤäƌIǤ.
C ɤ, PAFͥåȥ`Ұ?, pair-wisevBȥa
˥`Х륳ƥȤ`ɤƤ뤿ȿ. ()
? ޤ, }?κaλμϤä.
C DJ = {dj
m : for j {1J}, m {1Nj} }
? Nj: λjκa
? dj
m R2: ϲλjm?Η֪a.
? λa dj1
mdj2
nĤʤäƤ뤫? z j1j2
mn {0,1}x.
C `, ܤʿΤmʽMߺϤ碌?Ĥ뤳.
C ZzȫMߺϤ碌v뼯.
14
15. Method > Multi-Person Parsing using PAFs
? c?limbˤĤʤj1j2Υڥ.
C ʽ10cˤؤߤ?ˤʤ褦?gޥå?.
C Eclimb type cΥޥåȫؤߤ, Zclimb type cZΥ֥åȤ,
Emnϲλdj1
mdj2
ngpart affinity (ʽ10Ƕx)
C ʽ13,14, ĤΥåΩ`ɤΤ. (ͬtypeΣĤlimb
λιФ.) Hungarian algorithmm?Ĥ.
15
18. Results
? }?pose estimation2ĤΥ٥ީ`
C (1) MPII human multi-person dataset 25k images, 40k ppl, 410 human
activities
C (2) the COCO 2016 keypoints challenge dataset
? ʌg״rλǩ`å
? 줾SotA.
? Ӌㄿʤv뿼Ӥ(Fig10. )
18
19. Results > Results on the MPII Multi-Person Dataset
? 铂PCKh, ȫ岿λvmean Average Precision(mAP)ָˤ?^.
? inference/optmization time per image?^
? ƥȥåȤˤY
C mAP: ؤ?ǏSotA8.5%ϻؤ루ϣ
? Scale searchoȤȴ?. MAPIIǩ`ȫǤ, 13%ʤä. scale search
ʤ.
? ȴ??٤PAFsλgvSԤFΤЄʤȤ狼.
C inference time: 6礯ʤäԔ3.3ǣ
19
20. Results > Results on the MPII Multi-Person Dataset
? a?
C mean Average Precision(mAP)
? PrecisionƥबжΤΤgHäΤθ. _
C λжΤΤäΤθ
? Recallǩ`åȤȫΤƥबжΤθϣЩ`ʣ
C ǩ`åȤǥΥƩ`ȤƤ벿λΤ֪줿Τθ
? Average Precision(AP: ƽm)PrecisionRecallˤĤƽȤä.
C ¤ʽǽƤƤζत. I: ʤ1, ؓʤ0v
20
21. Results > Results on the MPII Multi-Person Dataset
? a?
C mean Average Precision(mAP)ؤΈ
? mAPȫƤ?βλˌƽprecision
C ޤ}дäƤ뻭ˌpose estimationg?
C ?PCKh铂ˤȤŤơestimate줿ݥȤground truth(GT)˸ϤƤƤ
C GT˸굱ƤʤäyݥȤϡfalse positiveȤƒQ
C λȤAverage PrecisionAPӋ㡣
C ȫβλvAPƽȡäơmAPˤʤ롣
C PCKh threshhold
? PCPѩ`ĤIˤβλΗʳλäΥѩ`Ĥ?ΰ֤˽
ʳɹȤ.
? PCK?bounding box铂Ȥƶx
? PCKhHeadȤ50%?铂Ȥƶx
21
22. Results > Results on the MPII Multi-Person Dataset
? ʤ륹ȥ?^Y
C (6b) ȫƤΥѥ`Ĥʤå, (6c) ?ޤΥĥ`å˥ץ
ߥǤȤ, (6d) ?ޤΥĥ`å؝르ꥺˤäƤȤ
C ?ޤΥåǡܵĤˤ?֤ʤȤ?Ƥ.
C (6d)ΥդʤäƤ.ȥ`˥΅Ϥ뤫פˤʤ뤿
ȿ.(13 edges vs 91 edges)
22