際際滷

際際滷Share a Scribd company logo
Leveraging Context to
Support Automated Food
Recognition in Restaurants
Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D.
Abowd, Irfan Essa
Presented by: Pedro Herruzo Sanchez
Contents
 Motivation
 Contributions of the paper
 System Overview
 Features & Classifier
 Evaluation on the PFID dataset
 Evaluation for In-the-wild food images
 Recognition without Location Prior
 Discussion
Motivation
 In 1970, 25.9% of food spending was on food away from home
 By 2012, 43.1%
 80% of Americans report eating fast-food monthly and 40% report
weakly
 Obesity, nutrition, and chronic diseases are now a major health
concern
 Logging eating habits are increasing to prevent diseases, using
diaries and smartphones
 Manually tracking (the most common) is time-consuming, prone
to errors, and it is susceptible to selective under-reporting
Contributions
 They develop an automated workflow where online resources are queried with
contextual data (location) to find images and additional information about the
restaurant where the food picture was taken, with the intent to build classifiers for
food recognition.
 Classification by SMO-MKL multi-class SVM with features extracted from test
photographs
 In the wild evaluation from food images taken in 10 restaurants across the 5
different cuisines: American, Indian, Italian, Mexican and Thai.
 Comparative evaluation focused on location information of the images
 Acquisition of food images using: FoodGawker, Instagram, Pintarest and Flickr
 Determine the restaurant where the picture was took, using longitude and latitude
coordinates, and APIs like Yelp or Google Places.
 Once they determine a particular restaurant R:
 Search for Rs menu (allmenus.com, openmenu.com).
 For each item in the menu of R, download the top 50 images from Google Images
WEAKLY-LABELED TRAINING DATA
 Test data formed by segmented images from a certain restaurant R
System Overview
System Overview
Figure 1. System overview. Pipeline from taking a picture until classifying.
 They focus on illumination Changes: Images taken in restaurant are typically indoor
and under varying conditions -> Color descriptors
 Harris-Laplace point detector as a Feature extractor
 For Feature descriptor they use:
 Color Moment Invariants: Give as a value for a region of an image
 Hue Histogram weighted by saturation (Invariant to changes and shifts in light intensity and
color)
 C-SIFT (Invariant to changes in light intensity)
 OpponentSIFT: Channels in the opponent color space described using SIFT descriptors
(Invariant to changes and shifts in light intensity)
 RGB-SIFT: Computed for each channel independently (Invariant to changes and shifts in
light intensity and color)
 SIFT (Invariant to changes and shifts in light intensity)
Features
 For a given restaurant R, 100.000 interest point are detected
 For each of the 6 descriptors:
 BoW histogram using k-means with k=1000
 Compute Extended Gaussian kernels
 A linear combination of this 6 kernels is learned by using the Sequential Minimal
Optimization (SMP) algorithm, with a p-norm, with p>1
 SVM to classify the images
Classification using SMO-MKL
100.000
6
clusters
1000
BoW 1
1
1
1
1
1
Extended
Gaussian
Kernels
K1
K6
descriptors
Evaluation on the PFID dataset
 PFID dataset:
 61 categories of fast food images acquired under lab conditions
 Each category contains 3 different instances of a foods, with 6 images from 6 different points of
view in each instance.
 3-fold cross validation: 12 images for training and the remaining 6 for testing
 In results, MKL gives the best performance and improves the state-of-the-art by more
than 20%
 Their SIFT approach achieves 34.9% accuracy whereas the SIFT used in the PFID
achieves 9.2% accuracy. Why?
 PFID baseline use LIB-SVM for classification with its default parameters.
 Current approach uses ^2 kernel (with scaled data), and tunes the SVM parameters (through a
grid-search over the space of C and 粒),
Figure 2. Two first results
(green) are the PFID
publication. The next two
(red) are obtained using
GIR and OM. The
remaining (blue) are the
obtained using feature
descriptors and MKL.
(CMI: Color Moment
Invariant, CS: C-SIFT,
HH: Hue-Histogram,
O-S: OpponentSIFT,
R-S: RGBSIFT, S: SIFT
and MKL: Multiple Kernel
Learning)
Evaluation for In-the-wild food images
 10 restaurants across the 5 different cuisines: American, Indian, Italian, Mexican and Thai
 3 different individuals collected images on different days from 10 different restaurants (2
per cuisines)
 600 images token in 2 phases:
 300 with a smartphone (5 cuisines  6 dishes/cuisine  10 images/dish)
 300 using a Goggle Glass
 In results, note they achieve a limited accuracy for the Mexican and Thai cuisines. They
have low degree of visible variability between food types belonging to the same cuisines.
Figure 3. Classification results. The columns are: CMI: Color Moment Invariant, C-S: C-SIFT, HH:
Hue-Histogram, O-S: OpponentSIFT, R-S: RGB-SIFT, S: SIFT and MKL: Multiple Kernel Learning
 Disregard the location information and train the SMO-MKL classifier on all of
the training data (3,750 images).
 Accuracy across the 600 test images is 15.67%, whereas the previous model
obtained, in average, had 63.33%
 Thus, the average performance increased by 47.66% when location prior was
included.
 They claim that it is better to build several smaller restaurant/cuisine specific
classifiers rather than one all-category food classifier
Recognition without Location Prior
 They built the training dataset based on the most popular foods for a particular
cuisine, matching 15 dishes from the menu of the restaurant where the test data was
token, i.e. They built the training data with the prior knowledge of the test data
 In our research:
 If we want to use this approach to get a good food classifier for the city of Barcelona, we
should somehow split Barcelona in small sections, where we will learn a small classifier in
each
 When a user will use the classifier, first we will locate where he/she is by geo-tags, and we
will use the corresponding classifier
 Cons:
 Allow GPS (Will be enough for pictures?)
 Many classifiers. Find the best cluster for each classifier
 Alternatives: One all-categories classifier weighted by GPS data.
Discussion
Suggestions?

More Related Content

Leveraging context to support automated food recognition in restaurants

  • 1. Leveraging Context to Support Automated Food Recognition in Restaurants Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D. Abowd, Irfan Essa Presented by: Pedro Herruzo Sanchez
  • 2. Contents Motivation Contributions of the paper System Overview Features & Classifier Evaluation on the PFID dataset Evaluation for In-the-wild food images Recognition without Location Prior Discussion
  • 3. Motivation In 1970, 25.9% of food spending was on food away from home By 2012, 43.1% 80% of Americans report eating fast-food monthly and 40% report weakly Obesity, nutrition, and chronic diseases are now a major health concern Logging eating habits are increasing to prevent diseases, using diaries and smartphones Manually tracking (the most common) is time-consuming, prone to errors, and it is susceptible to selective under-reporting
  • 4. Contributions They develop an automated workflow where online resources are queried with contextual data (location) to find images and additional information about the restaurant where the food picture was taken, with the intent to build classifiers for food recognition. Classification by SMO-MKL multi-class SVM with features extracted from test photographs In the wild evaluation from food images taken in 10 restaurants across the 5 different cuisines: American, Indian, Italian, Mexican and Thai. Comparative evaluation focused on location information of the images
  • 5. Acquisition of food images using: FoodGawker, Instagram, Pintarest and Flickr Determine the restaurant where the picture was took, using longitude and latitude coordinates, and APIs like Yelp or Google Places. Once they determine a particular restaurant R: Search for Rs menu (allmenus.com, openmenu.com). For each item in the menu of R, download the top 50 images from Google Images WEAKLY-LABELED TRAINING DATA Test data formed by segmented images from a certain restaurant R System Overview
  • 6. System Overview Figure 1. System overview. Pipeline from taking a picture until classifying.
  • 7. They focus on illumination Changes: Images taken in restaurant are typically indoor and under varying conditions -> Color descriptors Harris-Laplace point detector as a Feature extractor For Feature descriptor they use: Color Moment Invariants: Give as a value for a region of an image Hue Histogram weighted by saturation (Invariant to changes and shifts in light intensity and color) C-SIFT (Invariant to changes in light intensity) OpponentSIFT: Channels in the opponent color space described using SIFT descriptors (Invariant to changes and shifts in light intensity) RGB-SIFT: Computed for each channel independently (Invariant to changes and shifts in light intensity and color) SIFT (Invariant to changes and shifts in light intensity) Features
  • 8. For a given restaurant R, 100.000 interest point are detected For each of the 6 descriptors: BoW histogram using k-means with k=1000 Compute Extended Gaussian kernels A linear combination of this 6 kernels is learned by using the Sequential Minimal Optimization (SMP) algorithm, with a p-norm, with p>1 SVM to classify the images Classification using SMO-MKL
  • 10. Evaluation on the PFID dataset PFID dataset: 61 categories of fast food images acquired under lab conditions Each category contains 3 different instances of a foods, with 6 images from 6 different points of view in each instance. 3-fold cross validation: 12 images for training and the remaining 6 for testing In results, MKL gives the best performance and improves the state-of-the-art by more than 20% Their SIFT approach achieves 34.9% accuracy whereas the SIFT used in the PFID achieves 9.2% accuracy. Why? PFID baseline use LIB-SVM for classification with its default parameters. Current approach uses ^2 kernel (with scaled data), and tunes the SVM parameters (through a grid-search over the space of C and 粒),
  • 11. Figure 2. Two first results (green) are the PFID publication. The next two (red) are obtained using GIR and OM. The remaining (blue) are the obtained using feature descriptors and MKL. (CMI: Color Moment Invariant, CS: C-SIFT, HH: Hue-Histogram, O-S: OpponentSIFT, R-S: RGBSIFT, S: SIFT and MKL: Multiple Kernel Learning)
  • 12. Evaluation for In-the-wild food images 10 restaurants across the 5 different cuisines: American, Indian, Italian, Mexican and Thai 3 different individuals collected images on different days from 10 different restaurants (2 per cuisines) 600 images token in 2 phases: 300 with a smartphone (5 cuisines 6 dishes/cuisine 10 images/dish) 300 using a Goggle Glass In results, note they achieve a limited accuracy for the Mexican and Thai cuisines. They have low degree of visible variability between food types belonging to the same cuisines.
  • 13. Figure 3. Classification results. The columns are: CMI: Color Moment Invariant, C-S: C-SIFT, HH: Hue-Histogram, O-S: OpponentSIFT, R-S: RGB-SIFT, S: SIFT and MKL: Multiple Kernel Learning
  • 14. Disregard the location information and train the SMO-MKL classifier on all of the training data (3,750 images). Accuracy across the 600 test images is 15.67%, whereas the previous model obtained, in average, had 63.33% Thus, the average performance increased by 47.66% when location prior was included. They claim that it is better to build several smaller restaurant/cuisine specific classifiers rather than one all-category food classifier Recognition without Location Prior
  • 15. They built the training dataset based on the most popular foods for a particular cuisine, matching 15 dishes from the menu of the restaurant where the test data was token, i.e. They built the training data with the prior knowledge of the test data In our research: If we want to use this approach to get a good food classifier for the city of Barcelona, we should somehow split Barcelona in small sections, where we will learn a small classifier in each When a user will use the classifier, first we will locate where he/she is by geo-tags, and we will use the corresponding classifier Cons: Allow GPS (Will be enough for pictures?) Many classifiers. Find the best cluster for each classifier Alternatives: One all-categories classifier weighted by GPS data. Discussion