The document discusses BakeSearch, a recipe search tool that clusters recipes based on their ingredients using natural language processing and machine learning techniques to analyze large datasets of recipes. It outlines challenges in clustering over 40,000 recipes with 4,000 unique ingredients and describes using MapReduce on Amazon EMR to pre-calculate distances between all recipe pairs to cluster them into a graph. The tool is meant to help users find recipes based on enriched or depleted ingredients.
5. Disambiguating searches
Classic Chocolate chip cookies
Pattys best chocolate cookies Bigrams
Peanut butter cookies +
Sugar cookies with frosting Trigrams
Gooey butter cookies
Banana pumpkin cookies
Black and white cookies
Halloween cookies
Candidate labels
7. Defining distance measure
Recipe 1 Recipe 2
Ingr1
Ingr4
Ingr2
Ingr9
Ingr3
Ingr12
Ingr4
Ingredients in both recipes
Jaccard =
Ingredients in either recipe
8. Challenges of big data
Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
9. Challenges of big data
Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
40k baking recipes, 4k ingredients
4000
3000
# Recipes
2000
1000
0
0 10 20 30 40
# Ingredients in recipe
10. Challenges of big data
Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
40k baking recipes, 4k ingredients
4000
3000
# Recipes
2000
900
1000
# ingredients
600
0
0 10 20 30 40
# Ingredients 300 recipe
in
0
1 2 5 10 50 100 500 1000 5000 10000
# recipes containing ingredient
11. Challenges of big data
Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
40k baking recipes, 4k ingredients
Pre-calculate jaccard distances between every
pair of recipes (40k times 40k = 1.6 billion
pairs!)
12. Challenges of big data
Most clustering algorithms (k-
means, hierarchical, graph-based) take >30
seconds
40k baking recipes, 4k ingredients
Pre-calculate jaccard distances between every
pair of recipes (40k times 40k = 1.6 billion
pairs!)
MapReduce on Amazon EMR
Preload into networkx graph
16. Tools
Back end Analysis Front end
Yummly API Numpy, Scipy HTML/CSS/Jav
Python Nltk, network aScript
Pycurl x Twitter
Nltk wordnet Bootstrap
Python, R
MySQL Flask
Amazon EMR
Amazon AWS
17. Diane Wu
PhD Genetics, Stanford University, CA
BSc Computing Science, Simon Fraser, Canada
18. Diane Wu
PhD Genetics, Stanford University, CA
BSc Computing Science, Simon Fraser, Canada
19. Diane Wu
PhD Genetics, Stanford University, CA
BSc Computing Science, Simon Fraser, Canada