Lessons learned from over a year with Neo4j on a social network / recommendation engine. Presented at Neo4j user group in London, UK in 2012.
1 of 52
More Related Content
Neo4j - Tales from the Trenches
1. Neo4J Tales from the Trenches
A RECOMMENDATION ENGINE
CASE STUDY
Michal Bachman & Nicki Watt
@bachmanm & @techiewatt
2. Who we are
role = consultant
works on Nicki Watt works for
Opigram colleague of OpenCredo
works on works for
Michal Bachman role = consultant
uses
Neo4J partner of
4. Opigram
Recommendations/
Interesting Insights Things
generates
about
People who like
also tend to like
Opinions
People who like
tend to support
About
People who like (themselves) provides
describe themselves as
.
Panelists
8. Opigram
Started Feb 2011
Nov 2011
OpenCredo
Many lessons learned
Stats
~ 150k panelists (a.k.a. users)
~ 100k things (movies, books,)
~ 8M relationships
9. Neo4J
Graph Database
Schema-less (NoSQL)
Vertices and Edges
a.k.a. Nodes and Relationships
Traversals
Version 1.7 just released!
10. Neo4J
role = consultant
works on Nicki Watt works for
Opigram colleague of OpenCredo
works on works for
Michal Bachman role = consultant
uses
Neo4J partner of
11. Opigram + Neo4J
Taxonomy of things
Opinions on things
Recommendations
Offline Crunching
13. Lessons Learned
Everyone loves Neo4J! Find praise online
Trenches Talk - Aiming to provide
insight into some real problems
encountered and approaches to solutions
We have 5 practical lessons for you
Tips
Tricks
Troubles
17. Movie
review type
Michal Pulp Fiction
text =
descriptors =
Cool, Funny described as
described as votes = 1
votes = 1
Cool
Boring
type
type
Funny type
Descriptor
Romantic type
18. Movie
type
Michal Pulp Fiction
created review of
described as Cool
Review
Boring
text= type
type
Funny type
described as
Descriptor
Romantic type
20. Neo Node IDs
What are they
Can I use them to represent my keys
No!
Why not
Not Stable
Ids are garbage collected over time, thus
only guaranteed to be unique during a
specific time span
22. USER_ID NEO_ ACTIVE 1
NODE_ID
Michal
type
101 1 Y
2 type 4
102 2 Y Nicki Panelist
103 3 N
Y
3 type
Jim
MySQL
Jim is now Cool ! Cool 5
Boring
type
7
Funny type
type
6
8 Descriptor
Romantic type
23. Alternate ID Strategies
Client provided IDs
Add as a standard property on the node
Add to index (or use auto indexer)
Natural vs. Synthetic IDs
Auto generate your own IDs
Hook into Neo4J Transaction Kernel
Use auto indexer
24. Auto generate your own IDs
1) Implement TransactionEventHandler
2) Register TransactionEventHandler with graphDatabaseService
3) Turn auto indexing on for seamless generation
25. Lesson 2: Conclusion
Dont use Neo Node IDs as your keys!!!
Its a losing battle, ultimately the force
will not be with you!
credit: http://uk.xbox.gamespy.com
41. Extracting Randomised Data
Use Cases
Provide Random Suggestions to users
Use for statistical analysis aka Random
Sampling
Problem
No built in Neo4J support
Not Neo4Js sweet spot
May result in very bad performance
42. Options
Randomisation Strategies
Load, Shuffle, Pick
Hit and Miss
Custom Relationship Expander/Evaluator
Reservoir Sampling
Performance Helpers
Indexes
Front with a cache if need be
45. Traversals vs. Index
25 random nodes extracted from [Sample Size] using Reservoir Sampling algorithm
X-Axis: Sample Size
Y-Axis: Time (milliseconds)
45000
40000
35000
30000 1.5 TRAVERSAL PASS 1 (COLD)
1.4.2 TRAVERSAL PASS 1 (COLD)
25000
1.4.2 TRAVERSAL PASS 2 (WARMISH)
20000 1.5 TRAVERSAL PASS 2 (WARMISH)
1.5 INDEX
15000
1.6.2 TRAVERSAL PASS 1 (COLD)"
10000 1.6.2 TRAVERSAL PASS 2 (WARMISH)
5000
Use of lucene indexes
0 can reduce time to +- 300 -
5000 10000 20000 40000 80000 160000 1000ms from cold
46. Conclusion
Most options are not truly random
more randomish
Primarily has bad performance when
hitting cold parts of graph
Caching helps
If an option, serve stale data until next
random sample can be selected
#4: Nicki:A complete online profile of your interests, tastes and opinions. Designed to be useful to you and to the rest of the world. http://labs.yougov.co.uk
#15: Dont spend too much time on this:First 2: general and applicable to allNext 2: specific tips, there is a chance youll need themLast one: performance
#18: Describe the problem and how it evolvedCan express:Users descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolProblems with the resulting schema:Change in a review => need to update votesNeed to make sure described as is deleted when votes = 0Finding all people that described something as cool is too complicatedNot future-proof, what if we now want to review 2 things together (like Nicki and I)
#19: No need to keep track of votesCan still do all the traversals I needUsers descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolPLUS: All people that used a descriptorCan review multiple things now
#21: As Michal explained, a node is - Neo Node ID is - Neo4j generated Unique id- long (generally auto incrementing like Mysqlautoincrementing primary keys or Oracle sequences)- Easily accessible and exposed via Neo4J APIsMay hear that and think- great, I need a unique identifier, sounds like it does what I need, I shall just use that rather than manage it myselfENTER LESSON 1: Dont Use Neo Node IDs as your primary keys
#25: Benefits of this approach No code for you to worry about If you have multiple clients writing to the database (legacy system) this will be taken care for you under the coversgenerateUniqueID() needs to be unique across HA
#26: Different versions handle differently1.4.2 Mostly recycling of old IDs1.5+ Possible changing of IDs between server restartsTODO: Dont expose! + Index is your friend
#41: ProblemTrying to pick a random number of nodes out of the graphNot Neo4Js sweet spotEspecially hard when dealing with sub graphsExamplesPick some random nodes out of the graph to display to people to ask for recommendationsUse as part of statistical algorithms to make statements like People who tend to like . tend to also .SolutionsIf size small enough and known traversal pathLoad into Collections and shuffleIf size largeCustom Relationship ExpanderIf the whole graph is in play .ScattergunIndexer with Resevoir Sampling algorithmLessons Learned - Random Access type work is not Neo4Js sweet spot - Can get around it with indexes and random(ish) selection algorithms but may not be ideal
#43: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)Hit and MissAll nodes form part of population, not good when you want subsets of the graphGenerate random IDs, deal with cases of missesCustom Relationship Expander/EvaluatorRandomly discard relationships as you go alongIterables returned by traverser are generally not random, gives more precedence to nodes earlier onReservoir SamplingDesigned for use with IterablesRandomly build up and replace ultimate subset to returnUse an indexFront with a cache if need be
#44: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
#45: How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
#46: Mac OS - 10.7 8GB RamLeftOver +-4.5GB JVM Heap max 1.5GB Neo4J Mapped Memory Settings 2.0GBneostore.nodestore.db.mapped_memory =256Mneostore.relationshipstore.db.mapped_memory =768Mneostore.propertystore.db.mapped_memory =512Mneostore.propertystore.db.strings.mapped_memory=256Mneostore.propertystore.db.arrays.mapped_memory =256M Post Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 395.0M neostore.propertystore.db 2.8M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db 7.6M neostore.propertystore.db.arrays Pre Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 1000.0M neostore.propertystore.db 54.0M neostore.propertystore.db.arrays 8000.0M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db
#47: TODO: Mention disk accessHow Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)ScattergunAll nodes form part of populationGenerate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be