�ݺ�ߣ

ACM Multimedia 2008

Lei Wu, Xian-Sheng Hua, Nenghai Yu, Wei-Ying Ma, Shipeng Li

Microsoft Research Asia
University of Science and Technology of China

October 28, 2008

Multimedia
��
Information
Indexing
Ranking
Clustering
Annotation
Recommendation

Retrieval

2

Indexing

Ranking
Annotation Image
Similarity/
Distance
Multimedia
Information
Retrieval
Concept
Similarity/
Recommendation Distance Clustering

��

3

Image Similarity/Distance

Image
Similarity/
Distance

Concept
Similarity/
Distance

4


Numerous efforts have been made.
Concept Similarity/Distance

Concept
Similarity/
Distance

5


Numerous efforts have been made.

Concept Similarity/Distance
Olympic

Cat Sports
Paw Tiger
More and more used, but not well studied.

6

WordNet Distance

Google Distance

Tag Concurrence Distance

7

WordNet
150,000 words

WordNet Distance
Quite a few methods to get it in WordNet
Basic idea is to measure the length of the
path between two words

Pros and Cons
Pros: Built by human experts, so close to
human perception
Cons: Coverage is limited and difficult to
extend
8

Normalized Google Distance (NGD)
Reflects the concurrency of two words in Web documents
Defined as

Pros and Cons
Pros: Easy to get and huge coverage
Cons: Only reflects concurrency in textual documents. Not really
concept distance (semantic relationship)

9

Concept Pairs Google Distance
Airplane �C Dog 0.2562

Football �C Soccer 0.1905

Horse �C Donkey 0.2147

Airplane �C Airport 0.3094

Car �C Wheel 0.3146

10

Image Tag Concurrence Distance (Qi, Hua, et al. ACMMM07)
Reflects the frequency of two tags
occur in the same images
Based on the same idea of NGD
Mostly is sparse (> 95% are zero in
the similarity matrix)

Pros and Cons
Pros: Images are taken into account
Cons: a) Tags are sparse so visual
concurrency is not well reflected
b) Training data is difficult to get similarity matrix: 500 tags
similarity matrix: 50 tags

11

Concept Pairs Google Distance
Tag Concurrence Distance
Airplane �C Dog 0.2562 0.8532

Football �C Soccer 0.1905 0.1739

Horse �C Donkey 0.2147 0.4513

Airplane �C Airport 0.3094 0.1833

Car �C Wheel 0.3146 0.9617

12

table �� ping- horse �� donkey car �� wheel airplane �� airport
tennis pong

Synonymy Visually Similar Meronymy Concurrency
different words but similar things or part and exist at the same
the same meaning things of same type the whole scene/place

13

Image tag concurrence
distance implicitly uses
image information, but
Concept Distance tags are too sparse
Mine from image tags

Google distance��s coverage
is very high, but it is for
text domain
Mine from text documents

WordNet distance is good,
but coverage is too low
Mine from ontology

14

Can we mine concept distance
from image content?

15

Semantic concept distance is based on human��s cognition
To of humanconcept distanceinformation
80%
mine cognition comes from visual from a
large tagged image collection
There are around 2.8 billion photos on Flickr (by Sep 08)
based on image content
In average each Flickr image has around 8 tags

bear, fur, grass, tree polar bear, water, sea polar bear, fighting, usa

16

Concept A: Airplane Concept B: Airport

Concept Model A Concept Model B

Flickr Distance (A, B)

17

FlickrDistance
0.5151

0.0315

0.4231

0.0576

0.0708

Flickr Distance is able to cover the
four different semantic relationships
Synonymy, Visually Similar, Meronymy, and Concurrency

18

R1: A Good Image Collection
Large
High coverage, especially on real-life world
With tags

Real-life photos reflect human��s perception
Based on collective intelligence from grassroots

19

R2: A Good Concept Representation or Model
Based on image content
Can cover wider concept relationships
Can handle large-concept set
Concept Models
Discriminative SVM, Boosting, ��

Generative
Global Feature

Local Feature

w/o Spatial Relation Bag-of-Words (pLSA, LDA), ��

w/ Spatial Relation 2D HMM, MRF, �� 20

VLM �C Visual Language Model
Spatial-relation sensitive
Efficient
Concept Models object variations
Can handle
Discriminative SVM, Boosting, ��

Generative
Global Feature

Local Feature

w/o Spatial Relation Bag-of-Words, ��
w/ Spatial Relation 2D HMM, MRF, �� 21

I am talking about statistical language model.

Unigram Model P?wx w1w2 ? wn ? ? P?wx ?

Bigram Model P?wx w1w2 ? wn ? ? p?wx wx?1 ?

Trigram Model P?wx w1w2 ? wn ? ? P?wx wx?1 wx ?2 ?

22

Visual Word Generation

1 0 0 1 0 0 1 0
Patch ? Gradient Texture Histogram Hashing ? Visual Word

Image ? Patch

Unigram Model P?wxy w11w12 ? wmn ? ? P?wxy ?

Bigram Model P?wxy w11w12 ? wmn ? ? P wxy wx ?1, y ? ?
Trigram Model P?wxy w11w12 ? wmn ? ? p?w xy wx?1, y wx, y ?1 ?
Trigram VLM is estimated by directly counting from sufficient samples of each category.
To avoid the bias in the sampling, back-off smoothing method is adopted.
23

Comparison on Image Categorization
Caltech 8 categories / 5097 images

Accuracy (%)
100 88 90 90

80 64
59
60
40
20
0
pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM

Training Time (sec/image)
3.00
2.44
2.50
2.00
1.50 1.11
1.00 0.84
0.44
0.50 0.14
0.00
pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM

24

Why Latent-Topic

Latent-Topic VLM
Visual variations of concept are taken as latent topics

25

Latent-Topic VLM Training
Solved by EM algorithm,
The objective function is to maximize the joint distribution of concept
and its visual word arrangement Aw

maximize p? Aw , C ?
��? ? ?
P wxy wx ?1, y wx , y ?1 , d C
j ?
d C ?C
j x, y

Estimate the posteriors of the hidden topics Maximize the likelihood of visual arrangement

26

Comparison on Image Categorization
Caltech 8 categories / 5097 images

Accuracy (%)
90 90 94
100 88
80 64
59
60
40
20
0
pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM

Training Time (sec/image)
3.00
2.44
2.50
2.00
1.50 1.11
0.84
1.00
0.44
0.50 0.14 0.24
0.00
pLSA (BOW) LDA (BOW) 2D MHMM SVM VLM LT-VLM

27

Kullback �C Leibler (KL) divergence
Good, but not symmetric

topic distance

Jensen �CShannon (JS) divergence
Better, as it is symmetric
And, square root of JS divergence is a metric, so is Flickr Distance

topic distance

concept distance 28

Concept A: Airplane Concept B: Airport
Tag search in
Flickr

Concept Model A LT-VLM Concept Model B

Jensen-Shannon
Divergence
Flickr Distance (A, B)

29

Evaluation
Objective evaluation
Subjective evaluation

Applications
Concept clustering
Image annotation
Tag recommendation

30

Images
6,400,000 from Flickr

Concepts
130,000,000 different tags
10,000,000 filtered tags
1,000 randomly-selected tags

Comparison
Normalized Google Distance (NGD)
Tag Concurrence Distance (TCD)
Flickr Distance (FD)

31

Ground-Truth
12 persons are asked to score semantic correlation of each concept pair
Average scores are taken as ground-truth

Evaluate Accuracy of ��Relative Distance Pairs��
Step 1: Find all distance pairs D(a,b) and D(c,d)
Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth

Correct Rate
0.56
0.55
0.54
0.53
0.52
0.51
0.5
0.49
0.48
0.47

NGD TCD FD

32

Ground-Truth
WordNet Distance
Only 497 concepts (overlap of WordNet and the 1000 concepts)

Evaluate Accuracy of ��Relative Distance Pairs��
Step 1: Find all distance pairs D(a,b) and D(c,d)
Step 2: Check whether the order of D(a,b) and D(c,d) is consistent with ground-truth

Correct Rate
0.54
0.53
0.52
0.51
0.5
0.49
0.48
0.47
0.46
0.45

NGD TCD FD

33

Concept Clustering
23 concepts;
3 groups �C (1) outer space, (2) animal and (3) sports

Normalized Google Distance Tag Concurrence Distance Flickr Distance
Group1 Group2 Group3 Group 1 Group2 Group3 Group1 Group2 Group3
bears bowling baseball moon baseball basketball moon bears baseball
horses dolphin basketball space donkey bears Saturn dolphin basketball
bowling
moon donkey football Venus softball space donkey football
dolphin
space Saturn golf whale wolf football Venus golf snake
sharks soccer golf horses soccer
snake tennis horses sharks bowling
softball volleyball Saturn spiders softball
spiders sharks tennis volleyball
soccer
turtle whale
spiders
Venus tennis wolf
whale turtle
wolf volleyball

34

Based on an approach using concept relation
Dual Cross-Media Relevance Model (DCMRM, J. Liu et al. ACMMM 2007)
On 79 concepts / 79,000 images
NGD-DCMRM TCD-DCMRM FD-DCMRM
1200
960
1000
correct keywords
Total number of

800

600
423
400 354
301 310
212 186 212 193
200
55 53 57
0
1 2 3 4
NGD-DCMRM 55 212 212 301
TCD-DCMRM 53 186 193 310
FD-DCMRM 57 354 423 960

The number of correctly annotated keywords at the first N words
35

To Improve Tagging Quality
Eliminating tag incompletion, noises, and ambiguity
500 images / 10 recommended tags per image

0.78
0.758
0.76
0.74
0.72
0.7
0.68 0.665
0.66 0.652
0.64
0.62
0.6
0.58
NGD Tag Concurrent Distance Flickr Distance

Precision @ 10
36

Why VLM divergence can estimate concept distance?

Why FD works well even tags are not complete?

room patterns
Computer VLM: computer patterns of trigrams
distribution other patterns

room patterns TV patterns other patterns
TV

room patterns screen patterns other patterns
Office

37

If we find similar patterns in the images
associated with different concepts,
the corresponding concept
relationships can be discovered.

Computer Office

38

Flickr Distance
A novel approach to discover semantic relationships
from image content
based on real-life images from the Web
based on collective intelligence from grassroots

A distance more consistent with human��s perception

A measurement more effective in many applications

39

Flickr Distance as a Service.

40

�ݺ�ߣ

Flickr Distance

More Related Content

Flickr Distance