The document discusses processing linked data at high speeds using the Signal/Collect graph algorithm framework. It provides examples of how Signal/Collect can be used to perform tasks like RDFS subclass inference and PageRank calculation on semantic graphs. It also summarizes performance results showing that TripleRush, an implementation of Signal/Collect, outperforms other graph processing systems on benchmark datasets. Finally, it discusses ongoing work on graph partitioning with TripleRush.
1 of 74
Download to read offline
More Related Content
Thu bernstein key_warp_speed
2. Processing Linked Data at
Warp Speed
Abraham Bernstein
CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
4. "Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
7. "IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
8. Processing Graphs
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
11. Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
12. Vertices as stateful
processing units
Vertices interact through
signals along edges
Which are collected by a
processing function that
updates the vertex state
Processing Graphs Naturally
13. De鍖ne the graph structure
Vertices represent
RDFS classes
Edges from superclasses
to subclasses
Vertex state initialized
with the class that the
vertex represents
Signal/Collect: An Intuition
for RDFS subclass inference
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: animal
id: bird
id: owlid: penguin
Stutz et al, 2010
17. PageRank in Code
class Document(id: Any) extends Vertex(id, 0.15) {
def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _)
}
Algorithm
class Citation(citer: Any, cited: Any) extends Edge(citer, cited) {
def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights
}
ExecutionInitialization
object Algorithm {
def executeCitationRank(db: SparqlAccessor) {
val computeGraph = new AsynchronousComputeGraph()
val citations = new SparqlTuples(db, "select ?source ?target where {"
+ "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}")
citations foreach {
case (citer, cited) =>
computeGraph.addVertex[Document](citer)
computeGraph.addVertex[Document](cited)
computeGraph.addEdge[Citation](citer, cited)
}
computeGraph.execute()
}
}
18. Signal/Collect as a Platform
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA
19. Tr i p l e S t o re
E LV I S D Y L A N
J O B S
? X I N S P I R E D ? Y
? Y I N S P I R E D ? Z
I N S P I R E D
DATA QUERY
IN
SP
IR
E
D
22. DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
Elvis inspired Dylan!
Dylan inspired Jobs
?X inspired ?Y!
?Y inspired ?Z
23. DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }
Elvis inspired Dylan!
Dylan inspired Jobs
?X inspired ?Y!
?Y inspired ?Z
24. P e r f o r m a n c e R e s u l t s
Distributed (8 nodes), LUBM 10240 (~1.36 billion triples)
Single-node, LUBM 160 (~21 million triples)
Fastest L1 L2 L3 L4 L5 L6 L7 Geo.
of 10 runs mean
TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1
Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0
TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0
TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0
Fastest L1 L2 L3 L4 L5 L6 L7 Geo
of 10 runs mean
TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94
Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0
TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0
TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
25. O n g o i n g Wo r k :
G r a p h P a r t i t i o n i n g
https://github.com/uzh/triplerush
26. T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA
27. D e c o m p o s i n g F r a u d
P a t t e r n s
Participants can be
labeled as:
Splitters
Aggregators
Forwarders
$10k$11k
2 x $5k $10k
10k CHF
2k CHF
4k CHF
6k CHF
8k CHF
28. E l i c i t a t i o n P ro c e s s
FilterConnectMatch
29. B i t c o i n Tr a n s a c t i o n s :
R u n t i m e v s . M a t c h i n g C o m p l e x i t y
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
4" 6" 8" 10" 12"
Time%(sec)%
Matching%Complexity%
Time"in"GC"
Total"Processing"Time"
Dataset Size: 50M Transactions, Matching Duration: 1 week
Runtime: 27 min
Throughput: 1.8M / min
Runtime: 35 min
Throughput: 1.4M / min
Runtime: 98 min
Throughput: 0.5M / min
30. T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA
32. Vertex Coloring in action
Optimized Version of DSA Running on a MacBook Pro with 8 workers
(slow, due to lots of IO for logging, bookkeeping, etc.)
Scaled to: 10 Million vertices / variables
Verman and Bernstein, 2014
33. Industry Usage
!
We
are
using
Signal/Collect
to
analyze
millions
of
claims
every
day
to
iden9fy
opportuni9es
for
our
clients
to
save
money
through
be=er
healthcare
or
avoiding
fraud,
waste,
and
abuse.
US
Healthcare
Analy/cs
Company
34. "IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
35. Processing Graph Streams
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
48. Semantic Flow
Processing is:
Time-stamped tripes t = <s,p,o> [time]
Semantic 鍖ow F = [t1, t2, ... tn]
Perform query matching on cached subset of F
Subject to Stressincoming data-rate
overwhelms the systems processing
capability
52. Eviction is:
Remove cached data
Note: Eviction may lower recall
Evict potential to produce future results
Types of eviction strategies (considered):
Random
Time-based (i.e, FIFO)
Least Recently Used (LRU)
56. Why not LRU?
LRU
C
Input: EPG
Channel , <Play>, Show 5E
B
Input: EPG
Channel , <Play>, Show 4D
InputInput: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
D
EPG Cache
57. CLOCK is:
Consider both recency (LRU) and past
results
Giving each data entry a score
The score could be incremented and
depreciated
Named by the buffer management
algorithm CLOCK
72. Limitations
Static depreciation function: dep()
Experiment on other datasets
Real implementation to investigate other
performance metrics
Only local eviction strategies considered
73. CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: !
Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
CLOCK Eviction is:
!
The weight is adjusted by dep():"
Linear: dep(w) = w - 1"
Exponential: dep(w) = w * (0< < 1)
Recency History