ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Deep Image:
Scaling up Image Recognition	

Ren Wu	

Distinguished Scientist, Baidu	

wuren@	

@ÈÍÔÚ°Ù¶È
The Color of the Dress
The Color of the Dress	

Color Constancy	

	

Human vs. 	

Arti?cial Intelligence
DeepImage_GTC15_public
Summary @ GTC14	

Big data + Deep learning + High performance computing =
Intelligence	

	

Big data + Deep learning + Heterogeneous computing =
Success	

	

GTC¡¯14: Deep Learning Meets Heterogeneous Computing
Baidu Stock
Baidu Q2¡¯14
Big Data	
?? 2000PBStorage
?? 10-100PB/dayProcessing
?? 100b-1000bWebpages
?? 100b-1000bIndex
?? 1b-10b/dayUpdate
?? 100TB~1PB/dayLog
Computer Chess and Moore¡¯s Law
Deep Blue	

A classic example of application-speci?c system design comprised
of an IBM supercomputer with 480 custom-madeVLSI chess chips, running
massively parallel search algorithm with highly optimized implementation.
Heterogeneous Computing	

1993 world #1	

Think Machine CM5/1024	

131 GFlops	

2013	

Samsung Note 3 smartphone	

(Qualcomm SnapDragon 800)	

129 G?ops	

2000 world #1 	

ASCI White (IBM RS/6000SP)	

6MW power, 106 tons	

12.3 TFlops	

2013	

Two MacPro workstation	

(dual AMD GPUs each)	

14 TFlops
Deep Learning Applications	

?? Speech recognition	

?? Image recognition	

?? Optical character recognition (OCR)	

?? Language translation 	

?? Web search	

?? Computational Ads (CTR)	

?? ¡­
ImageNet Large-Scale Visual Recognition Challenge	

?? ImageNet dataset	

?? More than 15 million images belonging to about 22,000 categories	

?? ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)	

?? Classi?cation task: 1.2 million images contains 1,000 categories	

?? One of the most challenging computer vision benchmarks	

?? Increasing attention both from industry and academic communities	

* Olga Russakovsky et al. ECCV 2014
ImageNet Classi?cation Challenge
ImageNet classi?cation 2012-2014 	

Team	

 Year	

 Place	

 Error (top-5)	

 Uses external
data	

SuperVision	

 2012 	

 -	

 16.4%	

 no	

SuperVision	

 2012 	

 1st	

 15.3%	

 ImageNet 22k	

Clarifai	

 2013	

 -	

 11.7%	

 no	

Clarifai	

 2013	

 1st	

 11.2%	

 ImageNet 22k	

MSRA	

 2014 	

 3rd	

 7.35%	

 no	

VGG	

 2014	

 2nd	

 7.32%	

 no	

GoogLeNet	

 2014	

 1st	

 6.67%	

 no	

ºÝºÝߣ credit: Yangqing Jia, Google	

 Invincible ?
Our approach ¨C Insights and inspirations	
¶àËãʤÉÙË㲻ʤ	

	

Ëï?×Ó (544-496 BC) ¼Æƪ	

	

More calculations win, few
calculation lose	

ÔªÔª±¾±¾éé?¼ûÇ¢ÎÅ	

	

°à¹Ì (32-92 AD) ?Î÷¶¼¸³ 	

	

Meaning the more you see the
more you know	

Ã÷?×ãÒÔ²ìÇïºÁ֮ĩ	

	

ÃÏ?×Ó (372-289 BC) Áº»Ý?ÍõÉÏ	

	

ability to see very ?ne details
Project Minwa ¨C °Ù¶ÈÃôæ´	

?? Minerva + Athena + ?Å®æ´	

?? Athena: Goddess of Wisdom,Warfare,
Divine Intelligence,Architecture, and Crafts	

?? Minerva: Goddess of wisdom, magic,
medicine, arts, commerce and defense	

?? ?Å®æ´: ÞÒ?ÍÁÔì?ÈË, Á¶?ʯ²¹Ìì, »éÒö, ÀÖÆ÷	

	

World¡¯s Largest Arti?cial Neural Networks	

	

v??Pushing the State-of-the-Art	

v??~ 100x bigger than previous ones	

v??New kind of Intelligence?
Hardware/Software Co-design	
?? Stochastic gradient decent (SGD)	

?? High compute density	

?? Scale up, up to 100 nodes	

?? High bandwidth low latency	

?? 36 nodes, 144 GPUs, 6.9TB Host, 1.7TB Device	

?? 0.6 PFLOPS 	

?? Highly Optimized software stack	

?? RDMA/GPU Direct 	

?? New data partition and communication
strategies	

GPUs	
In?niband
Minwa
Speedup ( wall time for convergence ) 	
Validation set accuracy for different numbers of GPUs	
0	
 ?
0.1	
 ?
0.2	
 ?
0.3	
 ?
0.4	
 ?
0.5	
 ?
0.6	
 ?
0.7	
 ?
0.8	
 ?
0.9	
 ?
0.25	
 ? 0.5	
 ? 1	
 ? 2	
 ? 4	
 ? 8	
 ? 16	
 ? 32	
 ? 64	
 ? 128	
 ? 256	
 ?
Accuracy
Time (hours)
32 GPU
16 GPU
1 GPU
Accuracy 80%
32 GPU: 8.6 hours
1 GPU: 212 hours
Speedup: 24.7x
Never have enough training
examples!	

	

Key observations 	

?? Invariant to illuminant of the
scene	

?? Invariant to observers 	

Augmentation approaches	

?? Color casting	

?? Optical distortion	

?? Rotation and cropping etc	

Data Augmentation	

¡°?¼û¶àʶ??¹ã¡±
And the Color Constancy 	

	

Key observations 	

?? Invariant to illuminant of the scene	

?? Invariant to observers 	

Augmentation approaches	

?? Color casting	

?? Optical distortion	

?? Rotation and cropping etc	

The color of the Dress	

¡°Inspired by the color constancy principal.
Essentially, this ¡®forces¡¯ our neural network to
develop its own color constancy ability.¡±
Data Augmentation	

Augmentation The number of possible changes
Color casting 68920
Vignetting 1960
Lens distortion 260
Rotation 20
Flipping 2
Cropping 82944(crop size is 224x224, input image
size is 512x512)
Possible variations
The Deep Image system learned from ~2 billion examples, out
of 90 billion possible candidates.
Data augmentation vs. Over?tting
Examples	

Bathtub	
 Isopod	
Indian elephant	
 Ice bear	
Some hard cases addressed by adding our data augmentation.
Multi-scale training	

?? Same crop size, different
resolution	

?? Fixed-size 224*224	

?? Downsized training images	

?? Reduces computational costs	

?? But not for state-of-the-art	

?? Different models trained by
different image sizes	

256*256	
512*512	
?? High-resolution model works	

?? 256x256: top-5 7.96%	

?? 512x512: top-5 7.42% 	

?? Multi-scale models are
complementary	

?? Fused model: 6.97%	

¡°Ã÷²éÇïºÁ¡±
Multi-scale training	

Tricycle	
Washer	
Backpack	
Little blue heron
Tricycle
Model	

?? One basic con?guration has 16 layers	

?? The number of weights in our con?guration is 212.7M	

?? About 40% bigger than VGG¡¯s	

Team Top-1 val. error Top-5 val. error
GoogLeNet - 7.89%
VGG 25.9% 8.0%
Deep Image 24.88% 7.42%
Compare to state-of-the-art	

Deep Image has set the new record of 5.98% top-5 error rate for test dataset, a
10.2% relative improvement than the previous best result.	
Team Year Place Top-5 test error
SuperVision 2012 1 16.42%
ISI 2012 2 26.17%
VGG 2012 3 26.98%
Clarifai 2013 1 11.74%
NUS 2013 2 12.95%
ZF 2013 3 13.51%
GoogLeNet 2014 1 6.66%
VGG 2014 2 7.32%
MSRA 2014 3 8.06%
Andrew Howard 2014 4 8.11%
DeeperVision 2014 5 9.51%
Deep Image - -
5.98%
Latest results	

Team Date Top-5 test error
GoogLeNet 2014 6.66%
Deep Image 01/12/2015 5.98%
Deep Image 02/05/2015 5.33%
Microsoft 02/05/2015 4.94%
Google 03/02/2015 4.82%
Deep Image 03/17/2015 4.83%
Robustness
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
DeepImage_GTC15_public
Major differentiators 	

?? Customized built supercomputer dedicated for DL	

?? Simple, scalable algorithm + Fully optimized
software stack	

?? Larger models	

?? More Aggressive data augmentation	

?? Multi-scale, include high-resolution images	

Brute force + Insights 	

	

 	

 	

and push for extreme
Thank you!

More Related Content

DeepImage_GTC15_public

  • 1. Deep Image: Scaling up Image Recognition Ren Wu Distinguished Scientist, Baidu wuren@ @ÈÍÔÚ°Ù¶È
  • 2. The Color of the Dress
  • 3. The Color of the Dress Color Constancy Human vs. Arti?cial Intelligence
  • 5. Summary @ GTC14 Big data + Deep learning + High performance computing = Intelligence Big data + Deep learning + Heterogeneous computing = Success GTC¡¯14: Deep Learning Meets Heterogeneous Computing
  • 8. Big Data ?? 2000PBStorage ?? 10-100PB/dayProcessing ?? 100b-1000bWebpages ?? 100b-1000bIndex ?? 1b-10b/dayUpdate ?? 100TB~1PB/dayLog
  • 9. Computer Chess and Moore¡¯s Law
  • 10. Deep Blue A classic example of application-speci?c system design comprised of an IBM supercomputer with 480 custom-madeVLSI chess chips, running massively parallel search algorithm with highly optimized implementation.
  • 11. Heterogeneous Computing 1993 world #1 Think Machine CM5/1024 131 GFlops 2013 Samsung Note 3 smartphone (Qualcomm SnapDragon 800) 129 G?ops 2000 world #1 ASCI White (IBM RS/6000SP) 6MW power, 106 tons 12.3 TFlops 2013 Two MacPro workstation (dual AMD GPUs each) 14 TFlops
  • 12. Deep Learning Applications ?? Speech recognition ?? Image recognition ?? Optical character recognition (OCR) ?? Language translation ?? Web search ?? Computational Ads (CTR) ?? ¡­
  • 13. ImageNet Large-Scale Visual Recognition Challenge ?? ImageNet dataset ?? More than 15 million images belonging to about 22,000 categories ?? ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) ?? Classi?cation task: 1.2 million images contains 1,000 categories ?? One of the most challenging computer vision benchmarks ?? Increasing attention both from industry and academic communities * Olga Russakovsky et al. ECCV 2014
  • 15. ImageNet classi?cation 2012-2014 Team Year Place Error (top-5) Uses external data SuperVision 2012 - 16.4% no SuperVision 2012 1st 15.3% ImageNet 22k Clarifai 2013 - 11.7% no Clarifai 2013 1st 11.2% ImageNet 22k MSRA 2014 3rd 7.35% no VGG 2014 2nd 7.32% no GoogLeNet 2014 1st 6.67% no ºÝºÝߣ credit: Yangqing Jia, Google Invincible ?
  • 16. Our approach ¨C Insights and inspirations ¶àËãʤÉÙË㲻ʤ Ëï?×Ó (544-496 BC) ¼Æƪ More calculations win, few calculation lose ÔªÔª±¾±¾éé?¼ûÇ¢ÎÅ °à¹Ì (32-92 AD) ?Î÷¶¼¸³ Meaning the more you see the more you know Ã÷?×ãÒÔ²ìÇïºÁ֮ĩ ÃÏ?×Ó (372-289 BC) Áº»Ý?ÍõÉÏ ability to see very ?ne details
  • 17. Project Minwa ¨C °Ù¶ÈÃôæ´ ?? Minerva + Athena + ?Å®æ´ ?? Athena: Goddess of Wisdom,Warfare, Divine Intelligence,Architecture, and Crafts ?? Minerva: Goddess of wisdom, magic, medicine, arts, commerce and defense ?? ?Å®æ´: ÞÒ?ÍÁÔì?ÈË, Á¶?ʯ²¹Ìì, »éÒö, ÀÖÆ÷ World¡¯s Largest Arti?cial Neural Networks v??Pushing the State-of-the-Art v??~ 100x bigger than previous ones v??New kind of Intelligence?
  • 18. Hardware/Software Co-design ?? Stochastic gradient decent (SGD) ?? High compute density ?? Scale up, up to 100 nodes ?? High bandwidth low latency ?? 36 nodes, 144 GPUs, 6.9TB Host, 1.7TB Device ?? 0.6 PFLOPS ?? Highly Optimized software stack ?? RDMA/GPU Direct ?? New data partition and communication strategies GPUs In?niband
  • 19. Minwa
  • 20. Speedup ( wall time for convergence ) Validation set accuracy for different numbers of GPUs 0 ? 0.1 ? 0.2 ? 0.3 ? 0.4 ? 0.5 ? 0.6 ? 0.7 ? 0.8 ? 0.9 ? 0.25 ? 0.5 ? 1 ? 2 ? 4 ? 8 ? 16 ? 32 ? 64 ? 128 ? 256 ? Accuracy Time (hours) 32 GPU 16 GPU 1 GPU Accuracy 80% 32 GPU: 8.6 hours 1 GPU: 212 hours Speedup: 24.7x
  • 21. Never have enough training examples! Key observations ?? Invariant to illuminant of the scene ?? Invariant to observers Augmentation approaches ?? Color casting ?? Optical distortion ?? Rotation and cropping etc Data Augmentation ¡°?¼û¶àʶ??¹ã¡±
  • 22. And the Color Constancy Key observations ?? Invariant to illuminant of the scene ?? Invariant to observers Augmentation approaches ?? Color casting ?? Optical distortion ?? Rotation and cropping etc The color of the Dress ¡°Inspired by the color constancy principal. Essentially, this ¡®forces¡¯ our neural network to develop its own color constancy ability.¡±
  • 23. Data Augmentation Augmentation The number of possible changes Color casting 68920 Vignetting 1960 Lens distortion 260 Rotation 20 Flipping 2 Cropping 82944(crop size is 224x224, input image size is 512x512) Possible variations The Deep Image system learned from ~2 billion examples, out of 90 billion possible candidates.
  • 24. Data augmentation vs. Over?tting
  • 25. Examples Bathtub Isopod Indian elephant Ice bear Some hard cases addressed by adding our data augmentation.
  • 26. Multi-scale training ?? Same crop size, different resolution ?? Fixed-size 224*224 ?? Downsized training images ?? Reduces computational costs ?? But not for state-of-the-art ?? Different models trained by different image sizes 256*256 512*512 ?? High-resolution model works ?? 256x256: top-5 7.96% ?? 512x512: top-5 7.42% ?? Multi-scale models are complementary ?? Fused model: 6.97% ¡°Ã÷²éÇïºÁ¡±
  • 29. Model ?? One basic con?guration has 16 layers ?? The number of weights in our con?guration is 212.7M ?? About 40% bigger than VGG¡¯s Team Top-1 val. error Top-5 val. error GoogLeNet - 7.89% VGG 25.9% 8.0% Deep Image 24.88% 7.42%
  • 30. Compare to state-of-the-art Deep Image has set the new record of 5.98% top-5 error rate for test dataset, a 10.2% relative improvement than the previous best result. Team Year Place Top-5 test error SuperVision 2012 1 16.42% ISI 2012 2 26.17% VGG 2012 3 26.98% Clarifai 2013 1 11.74% NUS 2013 2 12.95% ZF 2013 3 13.51% GoogLeNet 2014 1 6.66% VGG 2014 2 7.32% MSRA 2014 3 8.06% Andrew Howard 2014 4 8.11% DeeperVision 2014 5 9.51% Deep Image - - 5.98%
  • 31. Latest results Team Date Top-5 test error GoogLeNet 2014 6.66% Deep Image 01/12/2015 5.98% Deep Image 02/05/2015 5.33% Microsoft 02/05/2015 4.94% Google 03/02/2015 4.82% Deep Image 03/17/2015 4.83%
  • 41. Major differentiators ?? Customized built supercomputer dedicated for DL ?? Simple, scalable algorithm + Fully optimized software stack ?? Larger models ?? More Aggressive data augmentation ?? Multi-scale, include high-resolution images Brute force + Insights and push for extreme