際際滷

際際滷Share a Scribd company logo
Diane wu   insight final demo
Recipe	
 search
Recipe	
 search
BakeSearch
Make	
 sense	
 of	
 recipes	
 and	
 bake	
 like	
 a	
 pro
Disambigua=ng	
 searches

Classic	
 Chocolate	
 chip	
 cookies
Pa6ys	
 best	
 chocolate	
 cookies       Bigrams	
 
Peanut	
 bu6er	
 cookies                      	
 +	
 
Sugar	
 cookies	
 with	
 fros=ng          Trigrams
Gooey	
 bu6er	
 cookies
Banana	
 pumpkin	
 cookies
Black	
 and	
 white	
 cookies
Halloween	
 cookies
                                          Candidate	
 labels
De鍖ning	
 distance	
 measure
           Recipe	
 1                  Recipe	
 2
               Ingr1	
 
                                           Ingr4	
 
               Ingr2	
 
                                           Ingr9	
 
               Ingr3	
 
                                          Ingr12	
 
               Ingr4




                       Ingredients	
 in	
 both	
 recipes
Jaccard	
 =
                      Ingredients	
 in	
 either	
 recipe
Cluster	
 recipes	
 based	
 on	
 ingredient
Cluster	
 recipes	
 based	
 on	
 ingredient
Challenges	
 of	
 big	
 data
≒ Most	
 clustering	
 algorithms	
 (k-足means,	
 
   hierarchical,	
 graph-足based)	
 take	
 >30	
 seconds
Challenges	
 of	
 big	
 data
≒ Most	
 clustering	
 algorithms	
 (k-足means,	
 
   hierarchical,	
 graph-足based)	
 take	
 >30	
 seconds	
 
≒ Pre-足calculate	
 jaccard	
 distances	
 between	
 
   every	
 pair	
 of	
 recipes	
 (1.6	
 billion	
 pairs!)
Challenges	
 of	
 big	
 data
                   ≒ Most	
 clustering	
 algorithms	
 (k-足means,	
 
                      hierarchical,	
 graph-足based)	
 take	
 >30	
 seconds	
 
                   ≒ Pre-足calculate	
 jaccard	
 distances	
 between	
 
                      every	
 pair	
 of	
 recipes	
 (1.6	
 billion	
 pairs!)	
 
            4000


            3000
# Recipes




            2000


            1000


               0
                   0    10       20         30     40
                         # Ingredients in recipe
Challenges	
 of	
 big	
 data
                   ≒ Most	
 clustering	
 algorithms	
 (k-足means,	
 
                      hierarchical,	
 graph-足based)	
 take	
 >30	
 seconds	
 
                   ≒ Pre-足calculate	
 jaccard	
 distances	
 between	
 
                      every	
 pair	
 of	
 recipes	
 (1.6	
 billion	
 pairs!)	
 
            4000


            3000
# Recipes




            2000

                                                                   900

            1000
                                                   # ingredients




                                                                   600
               0
                   0    10       20         30                     40
                         # Ingredients in recipe                   300




                                                                     0

                                                                         1   2   5   10    50     100              500      1000   5000   10000
                                                                                          # recipes containing ingredient
Challenges	
 of	
 big	
 data
≒ Most	
 clustering	
 algorithms	
 (k-足means,	
 
   hierarchical,	
 graph-足based)	
 take	
 >30	
 seconds	
 
≒ Pre-足calculate	
 jaccard	
 distances	
 between	
 
   every	
 pair	
 of	
 recipes	
 (1.6	
 billion	
 pairs!)	
 
≒ MapReduce	
 on	
 Amazon	
 EMR	
 
≒ Preload	
 into	
 networkx	
 graph
Find	
 enriched/depleted	
 ingredients




                            abs(Log-足2	
 ra=o)	
 >2
Domain-足speci鍖c	
 data	
 munging
≒ Ingredients:	
 nltk	
 dic=onary	
 
≒ Domain	
 knowledge	
 
≒ Unit	
 parsing
Tools
     Back	
 end                  Analysis                Front	
 end
≒ Yummly	
 API	
          ≒ Numpy,	
 Scipy	
    ≒ HTML/CSS/
≒ Python	
                 ≒ Nltk,	
                 JavaScript	
 
     Pycurl	
                networkx	
           ≒ Twi6er	
 
     Nltk	
 wordnet	
                                Bootstrap	
 
                             ≒ Python,	
 R	
 
≒ MySQL	
                                           ≒ Flask	
 
                             ≒ Amazon	
 EMR	
 
                                                      ≒ Amazon	
 AWS
Diane	
 Wu
≒ PhD	
 Gene=cs,	
 Stanford	
 University,	
 CA	
 
≒ BSc	
 Compu=ng	
 Science,	
 Simon	
 Fraser,	
 Canada
Diane	
 Wu
≒ PhD	
 Gene=cs,	
 Stanford	
 University,	
 CA	
 
≒ BSc	
 Compu=ng	
 Science,	
 Simon	
 Fraser,	
 Canada
Diane	
 Wu
≒ PhD	
 Gene=cs,	
 Stanford	
 University,	
 CA	
 
≒ BSc	
 Compu=ng	
 Science,	
 Simon	
 Fraser,	
 Canada

More Related Content

Diane wu insight final demo

  • 4. BakeSearch Make sense of recipes and bake like a pro
  • 5. Disambigua=ng searches Classic Chocolate chip cookies Pa6ys best chocolate cookies Bigrams Peanut bu6er cookies + Sugar cookies with fros=ng Trigrams Gooey bu6er cookies Banana pumpkin cookies Black and white cookies Halloween cookies Candidate labels
  • 6. De鍖ning distance measure Recipe 1 Recipe 2 Ingr1 Ingr4 Ingr2 Ingr9 Ingr3 Ingr12 Ingr4 Ingredients in both recipes Jaccard = Ingredients in either recipe
  • 7. Cluster recipes based on ingredient
  • 8. Cluster recipes based on ingredient
  • 9. Challenges of big data ≒ Most clustering algorithms (k-足means, hierarchical, graph-足based) take >30 seconds
  • 10. Challenges of big data ≒ Most clustering algorithms (k-足means, hierarchical, graph-足based) take >30 seconds ≒ Pre-足calculate jaccard distances between every pair of recipes (1.6 billion pairs!)
  • 11. Challenges of big data ≒ Most clustering algorithms (k-足means, hierarchical, graph-足based) take >30 seconds ≒ Pre-足calculate jaccard distances between every pair of recipes (1.6 billion pairs!) 4000 3000 # Recipes 2000 1000 0 0 10 20 30 40 # Ingredients in recipe
  • 12. Challenges of big data ≒ Most clustering algorithms (k-足means, hierarchical, graph-足based) take >30 seconds ≒ Pre-足calculate jaccard distances between every pair of recipes (1.6 billion pairs!) 4000 3000 # Recipes 2000 900 1000 # ingredients 600 0 0 10 20 30 40 # Ingredients in recipe 300 0 1 2 5 10 50 100 500 1000 5000 10000 # recipes containing ingredient
  • 13. Challenges of big data ≒ Most clustering algorithms (k-足means, hierarchical, graph-足based) take >30 seconds ≒ Pre-足calculate jaccard distances between every pair of recipes (1.6 billion pairs!) ≒ MapReduce on Amazon EMR ≒ Preload into networkx graph
  • 14. Find enriched/depleted ingredients abs(Log-足2 ra=o) >2
  • 15. Domain-足speci鍖c data munging ≒ Ingredients: nltk dic=onary ≒ Domain knowledge ≒ Unit parsing
  • 16. Tools Back end Analysis Front end ≒ Yummly API ≒ Numpy, Scipy ≒ HTML/CSS/ ≒ Python ≒ Nltk, JavaScript Pycurl networkx ≒ Twi6er Nltk wordnet Bootstrap ≒ Python, R ≒ MySQL ≒ Flask ≒ Amazon EMR ≒ Amazon AWS
  • 17. Diane Wu ≒ PhD Gene=cs, Stanford University, CA ≒ BSc Compu=ng Science, Simon Fraser, Canada
  • 18. Diane Wu ≒ PhD Gene=cs, Stanford University, CA ≒ BSc Compu=ng Science, Simon Fraser, Canada
  • 19. Diane Wu ≒ PhD Gene=cs, Stanford University, CA ≒ BSc Compu=ng Science, Simon Fraser, Canada