際際滷

際際滷Share a Scribd company logo
Neo4J  Tales from the Trenches


   A RECOMMENDATION ENGINE
          CASE STUDY

     Michal Bachman & Nicki Watt
     @bachmanm & @techiewatt
Who we are 
                                            role = consultant
          works on     Nicki Watt           works for

Opigram                      colleague of          OpenCredo

          works on                          works for
                     Michal Bachman         role = consultant




     uses
                          Neo4J                        partner of
Opigram
   http://labs.yougov.co.uk
   Opinion Profile
   Social Network (TBD)
   Recommendation Engine
   CMS
Opigram

Recommendations/
Interesting Insights                            Things
                         generates


                                                about

People who like 
also tend to like 
                                     Opinions

People who like 
tend to support 
                                  About
People who like              (themselves)   provides
describe themselves as
.
                                             Panelists
Neo4j - Tales from the Trenches
Neo4j - Tales from the Trenches
Neo4j - Tales from the Trenches
Opigram
 Started Feb 2011
 Nov 2011
     OpenCredo
     Many lessons learned
   Stats
   ~ 150k panelists (a.k.a. users)
   ~ 100k things (movies, books,)
   ~ 8M relationships
Neo4J
   Graph Database
   Schema-less (NoSQL)
   Vertices and Edges
   a.k.a. Nodes and Relationships
   Traversals
   Version 1.7 just released!
Neo4J
                                            role = consultant
          works on     Nicki Watt           works for

Opigram                      colleague of          OpenCredo

          works on                          works for
                     Michal Bachman         role = consultant




     uses
                          Neo4J                        partner of
Opigram + Neo4J
   Taxonomy of things
   Opinions on things
   Recommendations
   Offline Crunching
Opigram + MySQL
 CMS Functionality
 Crunching Results
 Configuration / Metadata
Lessons Learned
 Everyone loves Neo4J! Find praise online
 Trenches Talk - Aiming to provide
  insight into some real problems
  encountered and approaches to solutions
 We have 5 practical lessons for you
   Tips
   Tricks
   Troubles
Lessons Learned
   Lesson 1: Graph Schema
   Lesson 2: Neo Node IDs
   Lesson 3: Graph-wide Operations
   Lesson 4: Extracting Randomised Data
   Lesson 5: Multi-threading
Lesson 1

Graph Schema
Schema-less 




                Credit: Greencolander
Movie


         review                             type
Michal                   Pulp Fiction
         text =
         
         descriptors =
         Cool, Funny                      described as

                             described as    votes = 1
                              votes = 1

                                          Cool
                                                                Boring
                                                        type
                                                                  type
                                Funny            type


                                                         Descriptor
                              Romantic           type
Movie

                                               type
Michal                       Pulp Fiction




     created       review of


                   described as              Cool
     Review
                                                                   Boring
    text=                                               type
                                                                     type
                                    Funny           type

              described as
                                                            Descriptor
                                  Romantic          type
Lesson 2

Neo4J Node IDs
Neo Node IDs
 What are they
 Can I use them to represent my keys
   No!
 Why not
   Not Stable
   Ids are garbage collected over time, thus
    only guaranteed to be unique during a
    specific time span
Example

User Transformation
USER_ID   NEO_    ACTIVE                         1
          NODE_ID
                                        Michal
                                                             type

101       1           Y
                                                 2     type                          4
102       2           Y                 Nicki                         Panelist
103       3           N
                      Y
                                                 3           type
                                         Jim
              MySQL




      Jim is now Cool !             Cool                                         5
                                                                    Boring
                                                      type
                                    7
                            Funny                                     type
                                               type

                                                                             6
                                    8                        Descriptor
                           Romantic            type
Alternate ID Strategies
 Client provided IDs
   Add as a standard property on the node
   Add to index (or use auto indexer)
 Natural vs. Synthetic IDs
 Auto generate your own IDs
   Hook into Neo4J Transaction Kernel
   Use auto indexer
Auto generate your own IDs
1)   Implement TransactionEventHandler




2)   Register TransactionEventHandler with graphDatabaseService




3)   Turn auto indexing on for seamless generation
Lesson 2: Conclusion
Dont use Neo Node IDs as your keys!!!
Its a losing battle, ultimately the force
will not be with you!




            credit: http://uk.xbox.gamespy.com
Lesson 3

Graph-wide Operations
Motivations
 Fixes
     Bugs
     Re-indexing
   Schema Migrations
   Data Export
   Data Analysis
   Count Caching
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete relationships only from one side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Example

Deleting soft-deleted relationships
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete only from 1 side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Neo4j - Tales from the Trenches
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete only from 1 side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Neo4j - Tales from the Trenches
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete only from 1 side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Neo4j - Tales from the Trenches
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete only from 1 side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Example

Computing statistics
Neo4j - Tales from the Trenches
Lesson 3: Graph-wide Operations
   Batch Updates
   Delete only from 1 side
   GlobalGraphOperations since 1.6
   No need for TX when reading
Lesson 4

Extracting Randomised Data
Extracting Randomised Data
 Use Cases
   Provide Random Suggestions to users
   Use for statistical analysis aka Random
    Sampling
 Problem
   No built in Neo4J support
   Not Neo4Js sweet spot
   May result in very bad performance
Options
 Randomisation Strategies
   Load, Shuffle, Pick
   Hit and Miss
   Custom Relationship Expander/Evaluator
   Reservoir Sampling
 Performance Helpers
   Indexes
   Front with a cache if need be
Custom Relationship Evaluator
Reservoir Sampling Algorithm
Traversals vs. Index
25 random nodes extracted from [Sample Size] using Reservoir Sampling algorithm
         X-Axis:    Sample Size
         Y-Axis:    Time (milliseconds)

45000

40000

35000

30000                                                          1.5 TRAVERSAL PASS 1 (COLD)
                                                               1.4.2 TRAVERSAL PASS 1 (COLD)
25000
                                                               1.4.2 TRAVERSAL PASS 2 (WARMISH)
20000                                                          1.5 TRAVERSAL PASS 2 (WARMISH)
                                                               1.5 INDEX
15000
                                                               1.6.2 TRAVERSAL PASS 1 (COLD)"
10000                                                          1.6.2 TRAVERSAL PASS 2 (WARMISH)

 5000
                                                                   Use of lucene indexes
    0                                                              can reduce time to +- 300 -
        5000    10000    20000   40000    80000   160000           1000ms from cold
Conclusion
 Most options are not truly random
  more randomish
 Primarily has bad performance when
  hitting cold parts of graph
 Caching helps
   If an option, serve stale data until next
    random sample can be selected
Lesson 5

Multi-threading
Neo4j - Tales from the Trenches
Neo4j - Tales from the Trenches
Lesson 5: Multi-threading
 Shortcoming in Neo4J
 Fixed in version 1.7
 Avoid relationship properties in multi-
  threaded pre-1.7 apps
Questions?
Beer Time!
 @bachmanm
 michal.bachman@opencredo.com

 @techiewatt
 nicki.watt@opencredo.com

More Related Content

Neo4j - Tales from the Trenches

  • 1. Neo4J Tales from the Trenches A RECOMMENDATION ENGINE CASE STUDY Michal Bachman & Nicki Watt @bachmanm & @techiewatt
  • 2. Who we are role = consultant works on Nicki Watt works for Opigram colleague of OpenCredo works on works for Michal Bachman role = consultant uses Neo4J partner of
  • 3. Opigram http://labs.yougov.co.uk Opinion Profile Social Network (TBD) Recommendation Engine CMS
  • 4. Opigram Recommendations/ Interesting Insights Things generates about People who like also tend to like Opinions People who like tend to support About People who like (themselves) provides describe themselves as . Panelists
  • 8. Opigram Started Feb 2011 Nov 2011 OpenCredo Many lessons learned Stats ~ 150k panelists (a.k.a. users) ~ 100k things (movies, books,) ~ 8M relationships
  • 9. Neo4J Graph Database Schema-less (NoSQL) Vertices and Edges a.k.a. Nodes and Relationships Traversals Version 1.7 just released!
  • 10. Neo4J role = consultant works on Nicki Watt works for Opigram colleague of OpenCredo works on works for Michal Bachman role = consultant uses Neo4J partner of
  • 11. Opigram + Neo4J Taxonomy of things Opinions on things Recommendations Offline Crunching
  • 12. Opigram + MySQL CMS Functionality Crunching Results Configuration / Metadata
  • 13. Lessons Learned Everyone loves Neo4J! Find praise online Trenches Talk - Aiming to provide insight into some real problems encountered and approaches to solutions We have 5 practical lessons for you Tips Tricks Troubles
  • 14. Lessons Learned Lesson 1: Graph Schema Lesson 2: Neo Node IDs Lesson 3: Graph-wide Operations Lesson 4: Extracting Randomised Data Lesson 5: Multi-threading
  • 16. Schema-less Credit: Greencolander
  • 17. Movie review type Michal Pulp Fiction text = descriptors = Cool, Funny described as described as votes = 1 votes = 1 Cool Boring type type Funny type Descriptor Romantic type
  • 18. Movie type Michal Pulp Fiction created review of described as Cool Review Boring text= type type Funny type described as Descriptor Romantic type
  • 20. Neo Node IDs What are they Can I use them to represent my keys No! Why not Not Stable Ids are garbage collected over time, thus only guaranteed to be unique during a specific time span
  • 22. USER_ID NEO_ ACTIVE 1 NODE_ID Michal type 101 1 Y 2 type 4 102 2 Y Nicki Panelist 103 3 N Y 3 type Jim MySQL Jim is now Cool ! Cool 5 Boring type 7 Funny type type 6 8 Descriptor Romantic type
  • 23. Alternate ID Strategies Client provided IDs Add as a standard property on the node Add to index (or use auto indexer) Natural vs. Synthetic IDs Auto generate your own IDs Hook into Neo4J Transaction Kernel Use auto indexer
  • 24. Auto generate your own IDs 1) Implement TransactionEventHandler 2) Register TransactionEventHandler with graphDatabaseService 3) Turn auto indexing on for seamless generation
  • 25. Lesson 2: Conclusion Dont use Neo Node IDs as your keys!!! Its a losing battle, ultimately the force will not be with you! credit: http://uk.xbox.gamespy.com
  • 27. Motivations Fixes Bugs Re-indexing Schema Migrations Data Export Data Analysis Count Caching
  • 28. Lesson 3: Graph-wide Operations Batch Updates Delete relationships only from one side GlobalGraphOperations since 1.6 No need for TX when reading
  • 30. Lesson 3: Graph-wide Operations Batch Updates Delete only from 1 side GlobalGraphOperations since 1.6 No need for TX when reading
  • 32. Lesson 3: Graph-wide Operations Batch Updates Delete only from 1 side GlobalGraphOperations since 1.6 No need for TX when reading
  • 34. Lesson 3: Graph-wide Operations Batch Updates Delete only from 1 side GlobalGraphOperations since 1.6 No need for TX when reading
  • 36. Lesson 3: Graph-wide Operations Batch Updates Delete only from 1 side GlobalGraphOperations since 1.6 No need for TX when reading
  • 39. Lesson 3: Graph-wide Operations Batch Updates Delete only from 1 side GlobalGraphOperations since 1.6 No need for TX when reading
  • 41. Extracting Randomised Data Use Cases Provide Random Suggestions to users Use for statistical analysis aka Random Sampling Problem No built in Neo4J support Not Neo4Js sweet spot May result in very bad performance
  • 42. Options Randomisation Strategies Load, Shuffle, Pick Hit and Miss Custom Relationship Expander/Evaluator Reservoir Sampling Performance Helpers Indexes Front with a cache if need be
  • 45. Traversals vs. Index 25 random nodes extracted from [Sample Size] using Reservoir Sampling algorithm X-Axis: Sample Size Y-Axis: Time (milliseconds) 45000 40000 35000 30000 1.5 TRAVERSAL PASS 1 (COLD) 1.4.2 TRAVERSAL PASS 1 (COLD) 25000 1.4.2 TRAVERSAL PASS 2 (WARMISH) 20000 1.5 TRAVERSAL PASS 2 (WARMISH) 1.5 INDEX 15000 1.6.2 TRAVERSAL PASS 1 (COLD)" 10000 1.6.2 TRAVERSAL PASS 2 (WARMISH) 5000 Use of lucene indexes 0 can reduce time to +- 300 - 5000 10000 20000 40000 80000 160000 1000ms from cold
  • 46. Conclusion Most options are not truly random more randomish Primarily has bad performance when hitting cold parts of graph Caching helps If an option, serve stale data until next random sample can be selected
  • 50. Lesson 5: Multi-threading Shortcoming in Neo4J Fixed in version 1.7 Avoid relationship properties in multi- threaded pre-1.7 apps
  • 52. Beer Time! @bachmanm michal.bachman@opencredo.com @techiewatt nicki.watt@opencredo.com

Editor's Notes

  • #2: TODO Neo logo
  • #4: Nicki:A complete online profile of your interests, tastes and opinions. Designed to be useful to you and to the rest of the world. http://labs.yougov.co.uk
  • #10: Michal
  • #11: Example: find all the companies that work on Opigram
  • #14: Nicki
  • #15: Dont spend too much time on this:First 2: general and applicable to allNext 2: specific tips, there is a chance youll need themLast one: performance
  • #18: Describe the problem and how it evolvedCan express:Users descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolProblems with the resulting schema:Change in a review => need to update votesNeed to make sure described as is deleted when votes = 0Finding all people that described something as cool is too complicatedNot future-proof, what if we now want to review 2 things together (like Nicki and I)
  • #19: No need to keep track of votesCan still do all the traversals I needUsers descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolPLUS: All people that used a descriptorCan review multiple things now
  • #21: As Michal explained, a node is - Neo Node ID is - Neo4j generated Unique id- long (generally auto incrementing like Mysqlautoincrementing primary keys or Oracle sequences)- Easily accessible and exposed via Neo4J APIsMay hear that and think- great, I need a unique identifier, sounds like it does what I need, I shall just use that rather than manage it myselfENTER LESSON 1: Dont Use Neo Node IDs as your primary keys
  • #25: Benefits of this approach No code for you to worry about If you have multiple clients writing to the database (legacy system) this will be taken care for you under the coversgenerateUniqueID() needs to be unique across HA
  • #26: Different versions handle differently1.4.2 Mostly recycling of old IDs1.5+ Possible changing of IDs between server restartsTODO: Dont expose! + Index is your friend
  • #41: ProblemTrying to pick a random number of nodes out of the graphNot Neo4Js sweet spotEspecially hard when dealing with sub graphsExamplesPick some random nodes out of the graph to display to people to ask for recommendationsUse as part of statistical algorithms to make statements like People who tend to like . tend to also .SolutionsIf size small enough and known traversal pathLoad into Collections and shuffleIf size largeCustom Relationship ExpanderIf the whole graph is in play .ScattergunIndexer with Resevoir Sampling algorithmLessons Learned - Random Access type work is not Neo4Js sweet spot - Can get around it with indexes and random(ish) selection algorithms but may not be ideal
  • #43: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)Hit and MissAll nodes form part of population, not good when you want subsets of the graphGenerate random IDs, deal with cases of missesCustom Relationship Expander/EvaluatorRandomly discard relationships as you go alongIterables returned by traverser are generally not random, gives more precedence to nodes earlier onReservoir SamplingDesigned for use with IterablesRandomly build up and replace ultimate subset to returnUse an indexFront with a cache if need be
  • #44: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  • #45: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  • #46: Mac OS - 10.7 8GB RamLeftOver +-4.5GB JVM Heap max 1.5GB Neo4J Mapped Memory Settings 2.0GBneostore.nodestore.db.mapped_memory =256Mneostore.relationshipstore.db.mapped_memory =768Mneostore.propertystore.db.mapped_memory =512Mneostore.propertystore.db.strings.mapped_memory=256Mneostore.propertystore.db.arrays.mapped_memory =256M Post Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 395.0M neostore.propertystore.db 2.8M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db 7.6M neostore.propertystore.db.arrays Pre Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 1000.0M neostore.propertystore.db 54.0M neostore.propertystore.db.arrays 8000.0M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db
  • #47: TODO: Mention disk accessHow Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be