際際滷

際際滷Share a Scribd company logo
Diane wu Insight demo
Recipe search
Recipe search
BakeSearch
Make sense of recipes and bake like a pro
Disambiguating searches

Classic Chocolate chip cookies
Pattys best chocolate cookies      Bigrams
Peanut butter cookies                   +
Sugar cookies with frosting         Trigrams
Gooey butter cookies
Banana pumpkin cookies
Black and white cookies
Halloween cookies
                                 Candidate labels
Domain-specific data munging
 Ingredients: nltk dictionary
 Domain knowledge
 Unit parsing
Defining distance measure
        Recipe 1                 Recipe 2
            Ingr1
                                   Ingr4
            Ingr2
                                   Ingr9
            Ingr3
                                  Ingr12
            Ingr4




                    Ingredients in both recipes
Jaccard =
                Ingredients in either recipe
Challenges of big data
 Most clustering algorithms (k-
  means, hierarchical, graph-based) take >30
  seconds
Challenges of big data
               Most clustering algorithms (k-
                means, hierarchical, graph-based) take >30
                seconds
               40k baking recipes, 4k ingredients
            4000


            3000
# Recipes




            2000


            1000


              0
                   0   10       20         30     40
                        # Ingredients in recipe
Challenges of big data
               Most clustering algorithms (k-
                means, hierarchical, graph-based) take >30
                seconds
               40k baking recipes, 4k ingredients
            4000


            3000
# Recipes




            2000

                                                      900
            1000
                                      # ingredients




                                                      600
              0
                   0   10        20                         30           40
                        # Ingredients 300 recipe
                                       in


                                                       0

                                                                 1   2        5   10    50     100              500      1000   5000   10000
                                                                                       # recipes containing ingredient
Challenges of big data
 Most clustering algorithms (k-
  means, hierarchical, graph-based) take >30
  seconds
 40k baking recipes, 4k ingredients
 Pre-calculate jaccard distances between every
  pair of recipes (40k times 40k = 1.6 billion
  pairs!)
Challenges of big data
 Most clustering algorithms (k-
  means, hierarchical, graph-based) take >30
  seconds
 40k baking recipes, 4k ingredients
 Pre-calculate jaccard distances between every
  pair of recipes (40k times 40k = 1.6 billion
  pairs!)
 MapReduce on Amazon EMR
 Preload into networkx graph
Cluster recipes based on ingredient
Cluster recipes based on ingredient
Find enriched/depleted ingredients




                        abs(Log-2 ratio) >2
Tools
   Back end            Analysis         Front end
 Yummly API        Numpy, Scipy     HTML/CSS/Jav
 Python            Nltk, network     aScript
   Pycurl           x                Twitter
   Nltk wordnet                       Bootstrap
                    Python, R
 MySQL                               Flask
                    Amazon EMR
                                      Amazon AWS
Diane Wu
 PhD Genetics, Stanford University, CA
 BSc Computing Science, Simon Fraser, Canada
Diane Wu
 PhD Genetics, Stanford University, CA
 BSc Computing Science, Simon Fraser, Canada
Diane Wu
 PhD Genetics, Stanford University, CA
 BSc Computing Science, Simon Fraser, Canada

More Related Content

Diane wu Insight demo

  • 4. BakeSearch Make sense of recipes and bake like a pro
  • 5. Disambiguating searches Classic Chocolate chip cookies Pattys best chocolate cookies Bigrams Peanut butter cookies + Sugar cookies with frosting Trigrams Gooey butter cookies Banana pumpkin cookies Black and white cookies Halloween cookies Candidate labels
  • 6. Domain-specific data munging Ingredients: nltk dictionary Domain knowledge Unit parsing
  • 7. Defining distance measure Recipe 1 Recipe 2 Ingr1 Ingr4 Ingr2 Ingr9 Ingr3 Ingr12 Ingr4 Ingredients in both recipes Jaccard = Ingredients in either recipe
  • 8. Challenges of big data Most clustering algorithms (k- means, hierarchical, graph-based) take >30 seconds
  • 9. Challenges of big data Most clustering algorithms (k- means, hierarchical, graph-based) take >30 seconds 40k baking recipes, 4k ingredients 4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe
  • 10. Challenges of big data Most clustering algorithms (k- means, hierarchical, graph-based) take >30 seconds 40k baking recipes, 4k ingredients 4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients 300 recipe in 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient
  • 11. Challenges of big data Most clustering algorithms (k- means, hierarchical, graph-based) take >30 seconds 40k baking recipes, 4k ingredients Pre-calculate jaccard distances between every pair of recipes (40k times 40k = 1.6 billion pairs!)
  • 12. Challenges of big data Most clustering algorithms (k- means, hierarchical, graph-based) take >30 seconds 40k baking recipes, 4k ingredients Pre-calculate jaccard distances between every pair of recipes (40k times 40k = 1.6 billion pairs!) MapReduce on Amazon EMR Preload into networkx graph
  • 13. Cluster recipes based on ingredient
  • 14. Cluster recipes based on ingredient
  • 15. Find enriched/depleted ingredients abs(Log-2 ratio) >2
  • 16. Tools Back end Analysis Front end Yummly API Numpy, Scipy HTML/CSS/Jav Python Nltk, network aScript Pycurl x Twitter Nltk wordnet Bootstrap Python, R MySQL Flask Amazon EMR Amazon AWS
  • 17. Diane Wu PhD Genetics, Stanford University, CA BSc Computing Science, Simon Fraser, Canada
  • 18. Diane Wu PhD Genetics, Stanford University, CA BSc Computing Science, Simon Fraser, Canada
  • 19. Diane Wu PhD Genetics, Stanford University, CA BSc Computing Science, Simon Fraser, Canada