ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
ACM Multimedia 2008




Lei Wu, Xian-Sheng Hua, Nenghai Yu, Wei-Ying Ma, Shipeng Li

                     Microsoft Research Asia
          University of Science and Technology of China

                        October 28, 2008
Multimedia
     ¡­¡­
Information
   Indexing
   Ranking
  Clustering
  Annotation
 Recommendation

  Retrieval




                  2
Indexing




                                 Ranking
Annotation           Image
                   Similarity/
                    Distance
                  Multimedia
                  Information
                    Retrieval
                    Concept
                  Similarity/
 Recommendation    Distance      Clustering




                     ¡­¡­

                                              3
Image Similarity/Distance


         Image
       Similarity/
        Distance


        Concept
       Similarity/
        Distance




                            4
Image Similarity/Distance




Numerous efforts have been made.
 Concept Similarity/Distance


           Concept
          Similarity/
           Distance




                                   5
Image Similarity/Distance




   Numerous efforts have been made.




    Concept Similarity/Distance
              Olympic

   Cat                          Sports
            Paw         Tiger
More and more used, but not well studied.

                                            6
WordNet Distance

Google Distance

Tag Concurrence Distance




                           7
WordNet
150,000 words

WordNet Distance
Quite a few methods to get it in WordNet
Basic idea is to measure the length of the
path between two words

Pros and Cons
Pros: Built by human experts, so close to
      human perception
Cons: Coverage is limited and difficult to
      extend
                                             8
Normalized Google Distance (NGD)
Reflects the concurrency of two words in Web documents
Defined as




Pros and Cons
Pros: Easy to get and huge coverage
Cons: Only reflects concurrency in textual documents. Not really
      concept distance (semantic relationship)




                                                                   9
Concept Pairs Google Distance
 Airplane ¨C Dog      0.2562

Football ¨C Soccer    0.1905

 Horse ¨C Donkey      0.2147

Airplane ¨C Airport   0.3094

  Car ¨C Wheel        0.3146




                                10
Image Tag Concurrence Distance (Qi, Hua, et al. ACMMM07)
Reflects the frequency of two tags
occur in the same images
Based on the same idea of NGD
Mostly is sparse (> 95% are zero in
the similarity matrix)

Pros and Cons
Pros: Images are taken into account
Cons: a) Tags are sparse so visual
         concurrency is not well reflected
      b) Training data is difficult to get   similarity matrix: 500 tags
                                               similarity matrix: 50 tags

                                                                            11
Concept Pairs Google Distance
                   Tag Concurrence Distance
 Airplane ¨C Dog      0.2562   0.8532

Football ¨C Soccer    0.1905   0.1739

 Horse ¨C Donkey      0.2147   0.4513

Airplane ¨C Airport   0.3094   0.1833

  Car ¨C Wheel        0.3146   0.9617




                                              12
table ¡ª ping-         horse ¡ª donkey       car ¡ª wheel   airplane ¡ª airport
tennis   pong




 Synonymy             Visually Similar      Meronymy      Concurrency
different words but     similar things or     part and     exist at the same
the same meaning      things of same type    the whole        scene/place




                                                                               13
Image tag concurrence
                   distance implicitly uses
                   image information, but
Concept Distance   tags are too sparse
                            Mine from image tags


                   Google distance¡¯s coverage
                   is very high, but it is for
                   text domain
                             Mine from text documents


                   WordNet distance is good,
                   but coverage is too low
                              Mine from ontology

                                                        14
Can we mine concept distance
    from image content?




                               15
Semantic concept distance is based on human¡¯s cognition
To of humanconcept distanceinformation
80%
    mine cognition comes from visual from a
  large tagged image collection
There are around 2.8 billion photos on Flickr (by Sep 08)
       based on image content
In average each Flickr image has around 8 tags




   bear, fur, grass, tree   polar bear, water, sea   polar bear, fighting, usa



                                                                                 16
Concept A: Airplane                            Concept B: Airport




 Concept Model A                                Concept Model B




                      Flickr Distance (A, B)

                                                                    17
FlickrDistance
                                                               0.5151

                                                               0.0315

                                                               0.4231

                                                               0.0576

                                                               0.0708



 Flickr Distance is able to cover the
four different semantic relationships
    Synonymy, Visually Similar, Meronymy, and Concurrency

                                                                        18
R1: A Good Image Collection
Large
High coverage, especially on real-life world
With tags




                                 Real-life photos reflect human¡¯s perception
                               Based on collective intelligence from grassroots

                                                                                  19
R2: A Good Concept Representation or Model
Based on image content
Can cover wider concept relationships
Can handle large-concept set
Concept Models
     Discriminative          SVM, Boosting, ¡­

       Generative
       Global Feature

        Local Feature

          w/o Spatial Relation   Bag-of-Words (pLSA, LDA), ¡­

           w/ Spatial Relation   2D HMM, MRF, ¡­                20
VLM ¨C Visual Language Model
       Spatial-relation sensitive
       Efficient
Concept Models object variations
      Can handle
     Discriminative          SVM, Boosting, ¡­

       Generative
       Global Feature

        Local Feature

          w/o Spatial Relation   Bag-of-Words, ¡­
           w/ Spatial Relation   2D HMM, MRF, ¡­    21
I am talking about statistical language model.




Unigram Model     P?wx w1w2 ? wn ? ? P?wx ?

Bigram Model      P?wx w1w2 ? wn ? ? p?wx wx?1 ?

Trigram Model     P?wx w1w2 ? wn ? ? P?wx wx?1 wx ?2 ?


                                                         22
Visual Word Generation

                                                                                  1 0 0 1 0 0 1 0
                                      Patch ? Gradient   Texture Histogram        Hashing ? Visual Word




          Image ? Patch



Unigram Model                   P?wxy w11w12 ? wmn ? ? P?wxy ?

 Bigram Model                  P?wxy w11w12 ? wmn ? ? P wxy wx ?1, y ?                      ?
Trigram Model                  P?wxy w11w12 ? wmn               ? ? p?w      xy   wx?1, y wx, y ?1   ?
Trigram VLM is estimated by directly counting from sufficient samples of each category.
       To avoid the bias in the sampling, back-off smoothing method is adopted.
                                                                                                          23
Comparison on Image Categorization
Caltech 8 categories / 5097 images

                          Accuracy (%)
  100                                      88              90        90

   80                   64
            59
   60
   40
   20
    0
        pLSA (BOW)   LDA (BOW)      2D MHMM            SVM          VLM

                                                Training Time (sec/image)
                       3.00
                                                    2.44
                       2.50
                       2.00
                       1.50         1.11
                       1.00                                               0.84
                                                                  0.44
                       0.50                                                      0.14
                       0.00
                                 pLSA (BOW)       LDA (BOW)     2D MHMM   SVM    VLM

                                                                                        24
Why Latent-Topic


Latent-Topic VLM
Visual variations of concept are taken as latent topics




                                                          25
Latent-Topic VLM Training
       Solved by EM algorithm,
       The objective function is to maximize the joint distribution of concept
       and its visual word arrangement Aw


                      maximize p? Aw , C ?
                                 £½?          ?      ?
                                               P wxy wx ?1, y wx , y ?1 , d C
                                                                            j   ?
                                    d C ?C
                                      j      x, y




Estimate the posteriors of the hidden topics        Maximize the likelihood of visual arrangement

                                                                                                26
Comparison on Image Categorization
Caltech 8 categories / 5097 images

                              Accuracy (%)
                                                90      90         94
 100                               88
  80                 64
          59
  60
  40
  20
   0
       pLSA (BOW) LDA (BOW)      2D MHMM       SVM      VLM      LT-VLM



                                               Training Time (sec/image)
                          3.00
                                                 2.44
                          2.50
                          2.00
                          1.50          1.11
                                                                   0.84
                          1.00
                                                         0.44
                          0.50                                            0.14    0.24
                          0.00
                                  pLSA (BOW) LDA (BOW) 2D MHMM     SVM    VLM    LT-VLM

                                                                                          27
Kullback ¨C Leibler (KL) divergence
Good, but not symmetric

                                 topic distance

Jensen ¨CShannon (JS) divergence
Better, as it is symmetric
And, square root of JS divergence is a metric, so is Flickr Distance


                                                     topic distance




                                                  concept distance     28
Concept A: Airplane                            Concept B: Airport
                         Tag search in
                            Flickr




 Concept Model A            LT-VLM              Concept Model B


                                                  Jensen-Shannon
                                                    Divergence
                      Flickr Distance (A, B)

                                                                    29
Evaluation
Objective evaluation
Subjective evaluation

Applications
Concept clustering
Image annotation
Tag recommendation




                        30
Images
6,400,000 from Flickr

Concepts
130,000,000 different tags
10,000,000 filtered tags
1,000 randomly-selected tags

Comparison
Normalized Google Distance (NGD)
Tag Concurrence Distance (TCD)
Flickr Distance (FD)


                                   31
Ground-Truth
12 persons are asked to score semantic correlation of each concept pair
Average scores are taken as ground-truth

Evaluate Accuracy of ¡°Relative Distance Pairs¡±
Step 1: Find all distance pairs D(a,b) and D(c,d)
Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth


                                   Correct Rate
            0.56
            0.55
            0.54
            0.53
            0.52
            0.51
             0.5
            0.49
            0.48
            0.47

                       NGD               TCD                FD

                                                                                       32
Ground-Truth
WordNet Distance
Only 497 concepts (overlap of WordNet and the 1000 concepts)

Evaluate Accuracy of ¡°Relative Distance Pairs¡±
Step 1: Find all distance pairs D(a,b) and D(c,d)
Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth


                                 Correct Rate
           0.54
           0.53
           0.52
           0.51
            0.5
           0.49
           0.48
           0.47
           0.46
           0.45

                      NGD               TCD                FD

                                                                                       33
Concept Clustering
   23 concepts;
   3 groups ¨C (1) outer space, (2) animal and (3) sports

Normalized Google Distance        Tag Concurrence Distance                  Flickr Distance
Group1   Group2     Group3       Group 1   Group2     Group3       Group1      Group2     Group3
bears    bowling     baseball     moon     baseball   basketball   moon          bears     baseball
horses   dolphin    basketball    space    donkey       bears      Saturn       dolphin   basketball
                                                       bowling
 moon    donkey      football     Venus    softball                space        donkey     football
                                                       dolphin
space     Saturn       golf       whale      wolf      football    Venus          golf      snake
          sharks      soccer                             golf                   horses      soccer
          snake       tennis                           horses                   sharks     bowling
         softball   volleyball                         Saturn                   spiders    softball
         spiders                                       sharks                   tennis    volleyball
                                                        soccer
           turtle                                                                whale
                                                       spiders
          Venus                                         tennis                    wolf
          whale                                         turtle
            wolf                                      volleyball

                                                                                                       34
Based on an approach using concept relation
Dual Cross-Media Relevance Model (DCMRM, J. Liu et al. ACMMM 2007)
On 79 concepts / 79,000 images
                                       NGD-DCMRM   TCD-DCMRM   FD-DCMRM
                     1200
                                                                                         960
                     1000
  correct keywords
   Total number of




                      800

                      600
                                                                         423
                      400                                354
                                                                               301 310
                                               212 186         212 193
                      200
                             55   53   57
                       0
                                  1                 2               3               4
           NGD-DCMRM              55               212             212             301
           TCD-DCMRM              53               186             193             310
           FD-DCMRM               57               354             423             960


                     The number of correctly annotated keywords at the first N words
                                                                                               35
To Improve Tagging Quality
Eliminating tag incompletion, noises, and ambiguity
500 images / 10 recommended tags per image

 0.78
                                                      0.758
 0.76
 0.74
 0.72
  0.7
 0.68                           0.665
 0.66       0.652
 0.64
 0.62
  0.6
 0.58
             NGD        Tag Concurrent Distance   Flickr Distance


                        Precision @ 10
                                                                    36
Why VLM divergence can estimate concept distance?

Why FD works well even tags are not complete?



                           room patterns
       Computer                  VLM: computer patterns of trigrams
                                      distribution            other patterns




                           room patterns    TV patterns         other patterns
         TV




                           room patterns   screen patterns      other patterns
        Office

                                                                               37
If we find similar patterns in the images
   associated with different concepts,
      the corresponding concept
   relationships can be discovered.




       Computer             Office

                                       38
Flickr Distance
  A novel approach to discover semantic relationships
  from image content
   based on real-life images from the Web
   based on collective intelligence from grassroots


  A distance more consistent with human¡¯s perception

  A measurement more effective in many applications



                                                      39
Flickr Distance as a Service.




                                40
41

More Related Content

Flickr Distance

  • 1. ACM Multimedia 2008 Lei Wu, Xian-Sheng Hua, Nenghai Yu, Wei-Ying Ma, Shipeng Li Microsoft Research Asia University of Science and Technology of China October 28, 2008
  • 2. Multimedia ¡­¡­ Information Indexing Ranking Clustering Annotation Recommendation Retrieval 2
  • 3. Indexing Ranking Annotation Image Similarity/ Distance Multimedia Information Retrieval Concept Similarity/ Recommendation Distance Clustering ¡­¡­ 3
  • 4. Image Similarity/Distance Image Similarity/ Distance Concept Similarity/ Distance 4
  • 5. Image Similarity/Distance Numerous efforts have been made. Concept Similarity/Distance Concept Similarity/ Distance 5
  • 6. Image Similarity/Distance Numerous efforts have been made. Concept Similarity/Distance Olympic Cat Sports Paw Tiger More and more used, but not well studied. 6
  • 7. WordNet Distance Google Distance Tag Concurrence Distance 7
  • 8. WordNet 150,000 words WordNet Distance Quite a few methods to get it in WordNet Basic idea is to measure the length of the path between two words Pros and Cons Pros: Built by human experts, so close to human perception Cons: Coverage is limited and difficult to extend 8
  • 9. Normalized Google Distance (NGD) Reflects the concurrency of two words in Web documents Defined as Pros and Cons Pros: Easy to get and huge coverage Cons: Only reflects concurrency in textual documents. Not really concept distance (semantic relationship) 9
  • 10. Concept Pairs Google Distance Airplane ¨C Dog 0.2562 Football ¨C Soccer 0.1905 Horse ¨C Donkey 0.2147 Airplane ¨C Airport 0.3094 Car ¨C Wheel 0.3146 10
  • 11. Image Tag Concurrence Distance (Qi, Hua, et al. ACMMM07) Reflects the frequency of two tags occur in the same images Based on the same idea of NGD Mostly is sparse (> 95% are zero in the similarity matrix) Pros and Cons Pros: Images are taken into account Cons: a) Tags are sparse so visual concurrency is not well reflected b) Training data is difficult to get similarity matrix: 500 tags similarity matrix: 50 tags 11
  • 12. Concept Pairs Google Distance Tag Concurrence Distance Airplane ¨C Dog 0.2562 0.8532 Football ¨C Soccer 0.1905 0.1739 Horse ¨C Donkey 0.2147 0.4513 Airplane ¨C Airport 0.3094 0.1833 Car ¨C Wheel 0.3146 0.9617 12
  • 13. table ¡ª ping- horse ¡ª donkey car ¡ª wheel airplane ¡ª airport tennis pong Synonymy Visually Similar Meronymy Concurrency different words but similar things or part and exist at the same the same meaning things of same type the whole scene/place 13
  • 14. Image tag concurrence distance implicitly uses image information, but Concept Distance tags are too sparse Mine from image tags Google distance¡¯s coverage is very high, but it is for text domain Mine from text documents WordNet distance is good, but coverage is too low Mine from ontology 14
  • 15. Can we mine concept distance from image content? 15
  • 16. Semantic concept distance is based on human¡¯s cognition To of humanconcept distanceinformation 80% mine cognition comes from visual from a large tagged image collection There are around 2.8 billion photos on Flickr (by Sep 08) based on image content In average each Flickr image has around 8 tags bear, fur, grass, tree polar bear, water, sea polar bear, fighting, usa 16
  • 17. Concept A: Airplane Concept B: Airport Concept Model A Concept Model B Flickr Distance (A, B) 17
  • 18. FlickrDistance 0.5151 0.0315 0.4231 0.0576 0.0708 Flickr Distance is able to cover the four different semantic relationships Synonymy, Visually Similar, Meronymy, and Concurrency 18
  • 19. R1: A Good Image Collection Large High coverage, especially on real-life world With tags Real-life photos reflect human¡¯s perception Based on collective intelligence from grassroots 19
  • 20. R2: A Good Concept Representation or Model Based on image content Can cover wider concept relationships Can handle large-concept set Concept Models Discriminative SVM, Boosting, ¡­ Generative Global Feature Local Feature w/o Spatial Relation Bag-of-Words (pLSA, LDA), ¡­ w/ Spatial Relation 2D HMM, MRF, ¡­ 20
  • 21. VLM ¨C Visual Language Model Spatial-relation sensitive Efficient Concept Models object variations Can handle Discriminative SVM, Boosting, ¡­ Generative Global Feature Local Feature w/o Spatial Relation Bag-of-Words, ¡­ w/ Spatial Relation 2D HMM, MRF, ¡­ 21
  • 22. I am talking about statistical language model. Unigram Model P?wx w1w2 ? wn ? ? P?wx ? Bigram Model P?wx w1w2 ? wn ? ? p?wx wx?1 ? Trigram Model P?wx w1w2 ? wn ? ? P?wx wx?1 wx ?2 ? 22
  • 23. Visual Word Generation 1 0 0 1 0 0 1 0 Patch ? Gradient Texture Histogram Hashing ? Visual Word Image ? Patch Unigram Model P?wxy w11w12 ? wmn ? ? P?wxy ? Bigram Model P?wxy w11w12 ? wmn ? ? P wxy wx ?1, y ? ? Trigram Model P?wxy w11w12 ? wmn ? ? p?w xy wx?1, y wx, y ?1 ? Trigram VLM is estimated by directly counting from sufficient samples of each category. To avoid the bias in the sampling, back-off smoothing method is adopted. 23
  • 24. Comparison on Image Categorization Caltech 8 categories / 5097 images Accuracy (%) 100 88 90 90 80 64 59 60 40 20 0 pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM Training Time (sec/image) 3.00 2.44 2.50 2.00 1.50 1.11 1.00 0.84 0.44 0.50 0.14 0.00 pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM 24
  • 25. Why Latent-Topic Latent-Topic VLM Visual variations of concept are taken as latent topics 25
  • 26. Latent-Topic VLM Training Solved by EM algorithm, The objective function is to maximize the joint distribution of concept and its visual word arrangement Aw maximize p? Aw , C ? £½? ? ? P wxy wx ?1, y wx , y ?1 , d C j ? d C ?C j x, y Estimate the posteriors of the hidden topics Maximize the likelihood of visual arrangement 26
  • 27. Comparison on Image Categorization Caltech 8 categories / 5097 images Accuracy (%) 90 90 94 100 88 80 64 59 60 40 20 0 pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM Training Time (sec/image) 3.00 2.44 2.50 2.00 1.50 1.11 0.84 1.00 0.44 0.50 0.14 0.24 0.00 pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM 27
  • 28. Kullback ¨C Leibler (KL) divergence Good, but not symmetric topic distance Jensen ¨CShannon (JS) divergence Better, as it is symmetric And, square root of JS divergence is a metric, so is Flickr Distance topic distance concept distance 28
  • 29. Concept A: Airplane Concept B: Airport Tag search in Flickr Concept Model A LT-VLM Concept Model B Jensen-Shannon Divergence Flickr Distance (A, B) 29
  • 30. Evaluation Objective evaluation Subjective evaluation Applications Concept clustering Image annotation Tag recommendation 30
  • 31. Images 6,400,000 from Flickr Concepts 130,000,000 different tags 10,000,000 filtered tags 1,000 randomly-selected tags Comparison Normalized Google Distance (NGD) Tag Concurrence Distance (TCD) Flickr Distance (FD) 31
  • 32. Ground-Truth 12 persons are asked to score semantic correlation of each concept pair Average scores are taken as ground-truth Evaluate Accuracy of ¡°Relative Distance Pairs¡± Step 1: Find all distance pairs D(a,b) and D(c,d) Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth Correct Rate 0.56 0.55 0.54 0.53 0.52 0.51 0.5 0.49 0.48 0.47 NGD TCD FD 32
  • 33. Ground-Truth WordNet Distance Only 497 concepts (overlap of WordNet and the 1000 concepts) Evaluate Accuracy of ¡°Relative Distance Pairs¡± Step 1: Find all distance pairs D(a,b) and D(c,d) Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth Correct Rate 0.54 0.53 0.52 0.51 0.5 0.49 0.48 0.47 0.46 0.45 NGD TCD FD 33
  • 34. Concept Clustering 23 concepts; 3 groups ¨C (1) outer space, (2) animal and (3) sports Normalized Google Distance Tag Concurrence Distance Flickr Distance Group1 Group2 Group3 Group 1 Group2 Group3 Group1 Group2 Group3 bears bowling baseball moon baseball basketball moon bears baseball horses dolphin basketball space donkey bears Saturn dolphin basketball bowling moon donkey football Venus softball space donkey football dolphin space Saturn golf whale wolf football Venus golf snake sharks soccer golf horses soccer snake tennis horses sharks bowling softball volleyball Saturn spiders softball spiders sharks tennis volleyball soccer turtle whale spiders Venus tennis wolf whale turtle wolf volleyball 34
  • 35. Based on an approach using concept relation Dual Cross-Media Relevance Model (DCMRM, J. Liu et al. ACMMM 2007) On 79 concepts / 79,000 images NGD-DCMRM TCD-DCMRM FD-DCMRM 1200 960 1000 correct keywords Total number of 800 600 423 400 354 301 310 212 186 212 193 200 55 53 57 0 1 2 3 4 NGD-DCMRM 55 212 212 301 TCD-DCMRM 53 186 193 310 FD-DCMRM 57 354 423 960 The number of correctly annotated keywords at the first N words 35
  • 36. To Improve Tagging Quality Eliminating tag incompletion, noises, and ambiguity 500 images / 10 recommended tags per image 0.78 0.758 0.76 0.74 0.72 0.7 0.68 0.665 0.66 0.652 0.64 0.62 0.6 0.58 NGD Tag Concurrent Distance Flickr Distance Precision @ 10 36
  • 37. Why VLM divergence can estimate concept distance? Why FD works well even tags are not complete? room patterns Computer VLM: computer patterns of trigrams distribution other patterns room patterns TV patterns other patterns TV room patterns screen patterns other patterns Office 37
  • 38. If we find similar patterns in the images associated with different concepts, the corresponding concept relationships can be discovered. Computer Office 38
  • 39. Flickr Distance A novel approach to discover semantic relationships from image content based on real-life images from the Web based on collective intelligence from grassroots A distance more consistent with human¡¯s perception A measurement more effective in many applications 39
  • 40. Flickr Distance as a Service. 40
  • 41. 41