The document describes BakeSearch, a recipe search tool that clusters recipes based on ingredient similarity using the Jaccard distance measure. It discusses challenges in clustering a large number of recipes efficiently, including long runtimes for clustering algorithms and calculating distances between 1.6 billion recipe pairs. It proposes solutions like MapReduce on Amazon EMR and preloading data into a graph structure. The document also covers recipe data processing and tools used in the system.
5. Disambigua=ng
searches
Classic
Chocolate
chip
cookies
Pa6ys
best
chocolate
cookies Bigrams
Peanut
bu6er
cookies
+
Sugar
cookies
with
fros=ng Trigrams
Gooey
bu6er
cookies
Banana
pumpkin
cookies
Black
and
white
cookies
Halloween
cookies
Candidate
labels
6. De鍖ning
distance
measure
Recipe
1 Recipe
2
Ingr1
Ingr4
Ingr2
Ingr9
Ingr3
Ingr12
Ingr4
Ingredients
in
both
recipes
Jaccard
=
Ingredients
in
either
recipe
9. Challenges
of
big
data
≒ Most
clustering
algorithms
(k-足means,
hierarchical,
graph-足based)
take
>30
seconds
10. Challenges
of
big
data
≒ Most
clustering
algorithms
(k-足means,
hierarchical,
graph-足based)
take
>30
seconds
≒ Pre-足calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
11. Challenges
of
big
data
≒ Most
clustering
algorithms
(k-足means,
hierarchical,
graph-足based)
take
>30
seconds
≒ Pre-足calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
4000
3000
# Recipes
2000
1000
0
0 10 20 30 40
# Ingredients in recipe
12. Challenges
of
big
data
≒ Most
clustering
algorithms
(k-足means,
hierarchical,
graph-足based)
take
>30
seconds
≒ Pre-足calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
4000
3000
# Recipes
2000
900
1000
# ingredients
600
0
0 10 20 30 40
# Ingredients in recipe 300
0
1 2 5 10 50 100 500 1000 5000 10000
# recipes containing ingredient
13. Challenges
of
big
data
≒ Most
clustering
algorithms
(k-足means,
hierarchical,
graph-足based)
take
>30
seconds
≒ Pre-足calculate
jaccard
distances
between
every
pair
of
recipes
(1.6
billion
pairs!)
≒ MapReduce
on
Amazon
EMR
≒ Preload
into
networkx
graph