This is the presentation that Raffael Strassnig and I did at Strata London in May 2017.
Its all about a new graph-based similarity metric, the Expected Degrees of Separation, and how to estimate it using Apache Spark. Its based on work that we did in the Advanced Data Analytics team at Barclays in London.
2. Customer values and recommendation engines
Understanding customer values is a similar problem to recommendation
How is Tesco similar to Asda?
How is Tesco in Bristol similar to Asda in Bath?
How is Tesco in Bristol similar to Bristol Angling Centre?
Assume that the similarity of two businesses can be inferred by the extent
to which they share customers
3. The data: Customer transactions
Timestam
p
Customer Business Amount (贈)
Bob Smith Tesco, Bristol
Mary
Jones
Tesco, Bristol
Bob Smith Asda, Bath
John
Taylor
Bristol Angling Centre
4. Transactional data can be seen as a bipartite
graph
Tesco
Asda BP
Boots
Timestam
p
CustomerI
D
MerchantName Amount (贈)
1 Tesco
1 Asda
2 Boots
2 BP
3 Tesco
3 Boots
3 BP
4 Asda
5. Transactions imply customer preferences over
business values
Tesco
Price
Quality
Boots
Health
? Price ?
? Quality ?
Asda
Price
Customer
Business
6. Problems with conventional similarity metrics
Conventional recommenders (say Cosine similarity) are useless in
sparsely connected networks
基 = = (1,1,1,0,1,1,0,0)
= = 1,0,1,0,1,1,0,1
= cos
Customer Tesco Asda
Bob Smith Yes Yes
Mary Jones No Yes
John Taylor Yes Yes
Jane
Williams
No No
Gary Brown Yes Yes
Liz Davis Yes Yes
David Evans No No
Helen
Wilson
Yes No
7. Problems with conventional graph metrics
Conventional graph metrics (say minimum distance) are useless in
networks with significant hubs.
16. Comparing PageRank and Asymptotic EDS
PageRank: stationary distribution non-absorbing (eigenvector)
EDS: we would like to know how far we need to go (on average) to hit B
starting from A
Comment on ergodicity and teleporting
Teleporting solely mechanism to get around non-ergodic graphs
We assume a connected graph (eliminate graphs that only have
customers that solely shop with them) natural for retail not for websites.
17. Absorbing transition matrix
Discrete phase type distribution
Computes EDS for all sources to single destination
Gives exact results
=
0 1
, 乞倹 = 1
18. Spark implementation of Absorbing Transition
Matrix
[Spark code]
Has unacceptably high complexity ((4)) due to inverting large matrix
19. Estimating EDS using path sampling
An alternative is a sampling approach which is fully distributable
Complexity = ( )
Converges to analytical solution
Cautions
Shorter path length can have high variance ( < 4)
Signal dilution (applies to exact solution too)
23. Results vs Pagerank
Pagerank EDS
SEVERN RIVER CROSSIN
PLC
MCDONALDS
MCDONALDS PRIMARK STORES LDT
POST OFFICE COUNTER LIDL
LIDL BOOTS
PRIMARK STORES ALDI
ALDI MARKS&SPENCER PLC
MARTIN MCCOLL POST OFFICE COUNTER
MARKS&SPENCER PLC GREGGS PLC
24. Results
Results arent much use because of low differentiation between shops
Once you have escaped vicinity of a shop reverts to random walk.
Certain nodes connect everything (MCDONALDS, IKEA)
Theoretically interesting metric but not insightful
28. Brute force GPU implementation of Localised
EDS
Brute force matrix multiplication using GPUs
Parallelisation can be achieved by destination node
Sort-of scales (( . 3
) ) as long as matrix fits into memory
Can be faster than Spark
29. Compare results from EDS and Localised EDS
STARBUCKS BS16
Starbucks BS 35 1.51 McDonalds 54.0
KFC 2.04 Primark Stores 90.9
Krispy Creme 3.11 Lidl 105.9
The Old Fish Market Pub 4.8 Boots 120.2
IKEA 6.74 Marks & Spencer 126.3
30. Summary
We used a probabilistic graph similarity metric to derive a richer
characterisation of customer behaviour
We implemented a number of approaches to estimate it
All had complexity/scalability challenges
Use it yourselves and share what you find!
Further work
Derive the statistical properties of the EDS
Explore methods for approximation of matrix multiplication
#2: Introduction to Barclays Advanced Data Analytics (ADA)
ADA is a data science team which innovates, designs and builds applications that deliver, direct to customers, relevant analytical content that will help them make smart decisions to improve their lives.
We are aim to make applications that will revolutionise the way Barclays relates to our customers. The long term vision is to give each of our customers the same level of engagement and support in planning their finances and their lives as they would have if they were billionaires.