�ݺ�ߣ

Leveraging Context to
Support Automated Food
Recognition in Restaurants
Vinay Bettadapura, Edison Thomaz, Aman Parnami, Gregory D.
Abowd, Irfan Essa
Presented by: Pedro Herruzo Sanchez

Contents
● Motivation
● Contributions of the paper
● System Overview
● Features & Classifier
● Evaluation on the PFID dataset
● Evaluation for In-the-wild food images
● Recognition without Location Prior
● Discussion

Motivation
● In 1970, 25.9% of food spending was on food away from home
● By 2012, 43.1%
● 80% of Americans report eating fast-food monthly and 40% report
weakly
● Obesity, nutrition, and chronic diseases are now a major health
concern
● Logging eating habits are increasing to prevent diseases, using
diaries and smartphones
● Manually tracking (the most common) is time-consuming, prone
to errors, and it is susceptible to selective under-reporting

Contributions
● They develop an automated workflow where online resources are queried with
contextual data (location) to find images and additional information about the
restaurant where the food picture was taken, with the intent to build classifiers for
food recognition.
● Classification by SMO-MKL multi-class SVM with features extracted from test
photographs
● ‘In the wild’ evaluation from food images taken in 10 restaurants across the 5
different cuisines: American, Indian, Italian, Mexican and Thai.
● Comparative evaluation focused on location information of the images

● Acquisition of food images using: FoodGawker, Instagram, Pintarest and Flickr
● Determine the restaurant where the picture was took, using longitude and latitude
coordinates, and APIs like Yelp or Google Places.
● Once they determine a particular restaurant R:
○ Search for R’s menu (allmenus.com, openmenu.com).
○ For each item in the menu of R, download the top 50 images from Google Images
WEAKLY-LABELED TRAINING DATA
● Test data formed by segmented images from a certain restaurant R
System Overview

System Overview
Figure 1. System overview. Pipeline from taking a picture until classifying.

● They focus on illumination Changes: Images taken in restaurant are typically indoor
and under varying conditions -> Color descriptors
● Harris-Laplace point detector as a Feature extractor
● For Feature descriptor they use:
○ Color Moment Invariants: Give as a value for a region of an image
○ Hue Histogram weighted by saturation (Invariant to changes and shifts in light intensity and
color)
○ C-SIFT (Invariant to changes in light intensity)
○ OpponentSIFT: Channels in the opponent color space described using SIFT descriptors
(Invariant to changes and shifts in light intensity)
○ RGB-SIFT: Computed for each channel independently (Invariant to changes and shifts in
light intensity and color)
○ SIFT (Invariant to changes and shifts in light intensity)
Features

● For a given restaurant R, 100.000 interest point are detected
● For each of the 6 descriptors:
○ BoW histogram using k-means with k=1000
○ Compute Extended Gaussian kernels
● A linear combination of this 6 kernels is learned by using the Sequential Minimal
Optimization (SMP) algorithm, with a p-norm, with p>1
● SVM to classify the images
Classification using SMO-MKL

100.000
6
clusters
1000
BoW 1
1
1
1
1
1
Extended
Gaussian
Kernels
K1
K6
descriptors

Evaluation on the PFID dataset
● PFID dataset:
○ 61 categories of fast food images acquired under lab conditions
○ Each category contains 3 different instances of a foods, with 6 images from 6 different points of
view in each instance.
● 3-fold cross validation: 12 images for training and the remaining 6 for testing
● In results, MKL gives the best performance and improves the state-of-the-art by more
than 20%
● Their SIFT approach achieves 34.9% accuracy whereas the SIFT used in the PFID
achieves 9.2% accuracy. Why?
○ PFID baseline use LIB-SVM for classification with its default parameters.
○ Current approach uses χ^2 kernel (with scaled data), and tunes the SVM parameters (through a
grid-search over the space of C and γ),

Figure 2. Two first results
(green) are the PFID
publication. The next two
(red) are obtained using
GIR and OM. The
remaining (blue) are the
obtained using feature
descriptors and MKL.
(CMI: Color Moment
Invariant, CS: C-SIFT,
HH: Hue-Histogram,
O-S: OpponentSIFT,
R-S: RGBSIFT, S: SIFT
and MKL: Multiple Kernel
Learning)

Evaluation for In-the-wild food images
● 10 restaurants across the 5 different cuisines: American, Indian, Italian, Mexican and Thai
● 3 different individuals collected images on different days from 10 different restaurants (2
per cuisines)
● 600 images token in 2 phases:
○ 300 with a smartphone (5 cuisines × 6 dishes/cuisine × 10 images/dish)
○ 300 using a Goggle Glass
● In results, note they achieve a limited accuracy for the Mexican and Thai cuisines. They
have low degree of visible variability between food types belonging to the same cuisines.

Figure 3. Classification results. The columns are: CMI: Color Moment Invariant, C-S: C-SIFT, HH:
Hue-Histogram, O-S: OpponentSIFT, R-S: RGB-SIFT, S: SIFT and MKL: Multiple Kernel Learning

● Disregard the location information and train the SMO-MKL classifier on all of
the training data (3,750 images).
● Accuracy across the 600 test images is 15.67%, whereas the previous model
obtained, in average, had 63.33%
● Thus, the average performance increased by 47.66% when location prior was
included.
● They claim that it is better to build several smaller restaurant/cuisine specific
classifiers rather than one all-category food classifier
Recognition without Location Prior

● They built the training dataset based on the most popular foods for a particular
cuisine, matching 15 dishes from the menu of the restaurant where the test data was
token, i.e. They built the training data with the prior knowledge of the test data
● In our research:
○ If we want to use this approach to get a good food classifier for the city of Barcelona, we
should somehow split Barcelona in small sections, where we will learn a small classifier in
each
○ When a user will use the classifier, first we will locate where he/she is by geo-tags, and we
will use the corresponding classifier
● Cons:
○ Allow GPS (Will be enough for pictures?)
○ Many classifiers. Find the best cluster for each classifier
● Alternatives: One all-categories classifier weighted by GPS data.
Discussion

�ݺ�ߣ

Leveraging context to support automated food recognition in restaurants

More Related Content

Leveraging context to support automated food recognition in restaurants