Discussing the paper "Leveraging context to support automated food recognition in restaurants" for the CVUB reading group.
1 of 16
Download to read offline
More Related Content
Leveraging context to support automated food recognition in restaurants
1. Leveraging Context to
Support Automated Food
Recognition in Restaurants
Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D.
Abowd, Irfan Essa
Presented by: Pedro Herruzo Sanchez
2. Contents
Motivation
Contributions of the paper
System Overview
Features & Classifier
Evaluation on the PFID dataset
Evaluation for In-the-wild food images
Recognition without Location Prior
Discussion
3. Motivation
In 1970, 25.9% of food spending was on food away from home
By 2012, 43.1%
80% of Americans report eating fast-food monthly and 40% report
weakly
Obesity, nutrition, and chronic diseases are now a major health
concern
Logging eating habits are increasing to prevent diseases, using
diaries and smartphones
Manually tracking (the most common) is time-consuming, prone
to errors, and it is susceptible to selective under-reporting
4. Contributions
They develop an automated workflow where online resources are queried with
contextual data (location) to find images and additional information about the
restaurant where the food picture was taken, with the intent to build classifiers for
food recognition.
Classification by SMO-MKL multi-class SVM with features extracted from test
photographs
In the wild evaluation from food images taken in 10 restaurants across the 5
different cuisines: American, Indian, Italian, Mexican and Thai.
Comparative evaluation focused on location information of the images
5. Acquisition of food images using: FoodGawker, Instagram, Pintarest and Flickr
Determine the restaurant where the picture was took, using longitude and latitude
coordinates, and APIs like Yelp or Google Places.
Once they determine a particular restaurant R:
Search for Rs menu (allmenus.com, openmenu.com).
For each item in the menu of R, download the top 50 images from Google Images
WEAKLY-LABELED TRAINING DATA
Test data formed by segmented images from a certain restaurant R
System Overview
7. They focus on illumination Changes: Images taken in restaurant are typically indoor
and under varying conditions -> Color descriptors
Harris-Laplace point detector as a Feature extractor
For Feature descriptor they use:
Color Moment Invariants: Give as a value for a region of an image
Hue Histogram weighted by saturation (Invariant to changes and shifts in light intensity and
color)
C-SIFT (Invariant to changes in light intensity)
OpponentSIFT: Channels in the opponent color space described using SIFT descriptors
(Invariant to changes and shifts in light intensity)
RGB-SIFT: Computed for each channel independently (Invariant to changes and shifts in
light intensity and color)
SIFT (Invariant to changes and shifts in light intensity)
Features
8. For a given restaurant R, 100.000 interest point are detected
For each of the 6 descriptors:
BoW histogram using k-means with k=1000
Compute Extended Gaussian kernels
A linear combination of this 6 kernels is learned by using the Sequential Minimal
Optimization (SMP) algorithm, with a p-norm, with p>1
SVM to classify the images
Classification using SMO-MKL
10. Evaluation on the PFID dataset
PFID dataset:
61 categories of fast food images acquired under lab conditions
Each category contains 3 different instances of a foods, with 6 images from 6 different points of
view in each instance.
3-fold cross validation: 12 images for training and the remaining 6 for testing
In results, MKL gives the best performance and improves the state-of-the-art by more
than 20%
Their SIFT approach achieves 34.9% accuracy whereas the SIFT used in the PFID
achieves 9.2% accuracy. Why?
PFID baseline use LIB-SVM for classification with its default parameters.
Current approach uses ^2 kernel (with scaled data), and tunes the SVM parameters (through a
grid-search over the space of C and 粒),
11. Figure 2. Two first results
(green) are the PFID
publication. The next two
(red) are obtained using
GIR and OM. The
remaining (blue) are the
obtained using feature
descriptors and MKL.
(CMI: Color Moment
Invariant, CS: C-SIFT,
HH: Hue-Histogram,
O-S: OpponentSIFT,
R-S: RGBSIFT, S: SIFT
and MKL: Multiple Kernel
Learning)
12. Evaluation for In-the-wild food images
10 restaurants across the 5 different cuisines: American, Indian, Italian, Mexican and Thai
3 different individuals collected images on different days from 10 different restaurants (2
per cuisines)
600 images token in 2 phases:
300 with a smartphone (5 cuisines 6 dishes/cuisine 10 images/dish)
300 using a Goggle Glass
In results, note they achieve a limited accuracy for the Mexican and Thai cuisines. They
have low degree of visible variability between food types belonging to the same cuisines.
13. Figure 3. Classification results. The columns are: CMI: Color Moment Invariant, C-S: C-SIFT, HH:
Hue-Histogram, O-S: OpponentSIFT, R-S: RGB-SIFT, S: SIFT and MKL: Multiple Kernel Learning
14. Disregard the location information and train the SMO-MKL classifier on all of
the training data (3,750 images).
Accuracy across the 600 test images is 15.67%, whereas the previous model
obtained, in average, had 63.33%
Thus, the average performance increased by 47.66% when location prior was
included.
They claim that it is better to build several smaller restaurant/cuisine specific
classifiers rather than one all-category food classifier
Recognition without Location Prior
15. They built the training dataset based on the most popular foods for a particular
cuisine, matching 15 dishes from the menu of the restaurant where the test data was
token, i.e. They built the training data with the prior knowledge of the test data
In our research:
If we want to use this approach to get a good food classifier for the city of Barcelona, we
should somehow split Barcelona in small sections, where we will learn a small classifier in
each
When a user will use the classifier, first we will locate where he/she is by geo-tags, and we
will use the corresponding classifier
Cons:
Allow GPS (Will be enough for pictures?)
Many classifiers. Find the best cluster for each classifier
Alternatives: One all-categories classifier weighted by GPS data.
Discussion