際際滷

際際滷Share a Scribd company logo
Thu bernstein key_warp_speed
Processing Linked Data at
Warp Speed
Abraham Bernstein
CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0
"Earth's Location in the Universe (JPEG)" by Andrew Z. Colvin - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg#mediaviewer/File:Earth%27s_Location_in_the_Universe_(JPEG).jpg
Thu bernstein key_warp_speed
Thu bernstein key_warp_speed
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graphs
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: 	

Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Thu bernstein key_warp_speed
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
 Vertices as stateful
processing units	

 Vertices interact through
signals along edges	

 Which are collected by a
processing function that
updates the vertex state
Processing Graphs Naturally
 De鍖ne the graph structure	

 Vertices represent
RDFS classes	

 Edges from superclasses
to subclasses	

 Vertex state initialized
with the class that the
vertex represents
Signal/Collect: An Intuition
for RDFS subclass inference
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: animal
id: bird
id: owlid: penguin
Stutz et al, 2010
Signal/Collect: An Intuition
for RDFS subclass inference
Stutz et al, 2010
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
{bird, animal}
{animal}
{bird, animal}
id: animal
state: {animal}
id: bird
state: {bird}
id: owl
state: {owl}
id: penguin
state: {penguin}
id: penguin
state: {penguin, bird, animal}
id: bird
state: {bird, animal}
id: owl
state: {owl, bird, animal}
def collect =
state [
[
s2signals
s
def signal = state
Scoring/Asychronicity:
Single-Source Shortest Path
state: 
state: 
state: 
state: 
state: 
state: 
state: 0

1


1
1

state: 1 state: 
state: 
state: state: 1
state: 1 2
1
2

21
1

state: 1 state: 2
state: 2
state: 2state: 1
state: 1 2
1
2
3
21
1
3
state: 1 state: 2
state: 2
state: 2state: 1
state: 1
def signal = state + weight
def collect = min (state, min (signals) )
state: 1
Scoring/Asynchronicity:
Single-Source Shortest Path
state: 
state: 
state: 
state: 0
1
1
1
state: 1 state: 
state: 
state: 
state: 1 2
2
2
state: 2
state: 2
state: 2
state: 2
state: 2
var oldState = in鍖nity
def scoreSignal =
if (state ! = oldState)
1
else
0
3
3
PageRank in Code
class Document(id: Any) extends Vertex(id, 0.15) {
def collect = 0.15 + 0.85 * signals[Double].foldLeft(0.0)(_ + _)
}
Algorithm
class Citation(citer: Any, cited: Any) extends Edge(citer, cited) {
def signal = source.state.asInstanceOf[Double] * weight / source.sumOfOutWeights
}
ExecutionInitialization
object Algorithm {
def executeCitationRank(db: SparqlAccessor) {
val computeGraph = new AsynchronousComputeGraph()
val citations = new SparqlTuples(db, "select ?source ?target where {"
+ "?source <http://lsdis.cs.uga.edu/projects/semdis/opus#cites> ?target}")
citations foreach {
case (citer, cited) =>
computeGraph.addVertex[Document](citer)
computeGraph.addVertex[Document](cited)
computeGraph.addEdge[Citation](citer, cited)
}
computeGraph.execute()
}
}
Signal/Collect as a Platform
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA
Tr i p l e S t o re
E LV I S D Y L A N
J O B S
? X I N S P I R E D ? Y
? Y I N S P I R E D ? Z
I N S P I R E D
DATA QUERY
IN
SP
IR
E
D
DylanElvis inspired
*Elvis inspiredDylan
* inspired DylanElvis
*
*Elvis
*** inspired Dylan
* *
** *
DylanElvis inspired
*Elvis inspiredDylan
* inspired DylanElvis
*
*Elvis
*** inspired Dylan
* *
** *
*Dylan inspired Jobs
* inspiredJobsDylan
*
JobsDylan inspired
*Dylan
*Jobs
* *
DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }

Elvis inspired Dylan!
Dylan inspired Jobs

?X inspired ?Y!
?Y inspired ?Z
DylanElvis inspired
Dylan
* inspired
** inspired
*Dylan inspired Jobs
* inspired
JobsDylan inspired
Query Vertex
?X inspired ?Y!
?Y inspired ?Z
?X inspired ?Y!
?Y inspired ?Z
Elvis inspired Dylan!
Dylan inspired ?Z
Dylan inspired Jobs!
Jobs inspired ?Z
No vertex with ID!
[ Jobs inspired * ]
Query Vertex
{ ?X = Elvis, ?Y = Dylan, ?Z = Jobs }

Elvis inspired Dylan!
Dylan inspired Jobs

?X inspired ?Y!
?Y inspired ?Z
P e r f o r m a n c e R e s u l t s
Distributed (8 nodes), LUBM 10240 (~1.36 billion triples)
Single-node, LUBM 160 (~21 million triples)
Fastest L1 L2 L3 L4 L5 L6 L7 Geo.
of 10 runs mean
TripleRush 3,111.2 1,457.9 0.7 3.5 9.5 29.1 1,165.8 62.1
Trinity.RDF 12,648.0 6,018.0 8,735.0 5.0 4.0 9.0 31,214.0 450.0
TriAD 7,631.0 1,663.0 4,290.0 2.1 0.5 69.0 14,895.0 249.0
TriAD-SG 2,146.0 2,025.0 1,647.0 1.3 0.7 1.4 16,863.0 106.0
Fastest L1 L2 L3 L4 L5 L6 L7 Geo
of 10 runs mean
TripleRush 22.6 27.8 0.4 1.0 0.4 0.9 21.2 2.94
Trinity.RDF 281.0 132.0 110.0 5.0 4.0 9.0 630.0 46.0
TriAD 427.0 117.0 210.0 2.0 0.5 19.0 693.0 39.0
TriAD-SG 97.0 140.0 31.0 1.0 0.2 1.8 711.0 14.0
O n g o i n g Wo r k : 
G r a p h P a r t i t i o n i n g
 https://github.com/uzh/triplerush
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA
D e c o m p o s i n g F r a u d
P a t t e r n s
 Participants can be
labeled as:
 Splitters
 Aggregators
 Forwarders
$10k$11k
2 x $5k $10k
10k CHF
2k CHF
4k CHF
6k CHF
8k CHF
E l i c i t a t i o n P ro c e s s
FilterConnectMatch
B i t c o i n Tr a n s a c t i o n s :
R u n t i m e v s . M a t c h i n g C o m p l e x i t y
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
4" 6" 8" 10" 12"
Time%(sec)%
Matching%Complexity%
Time"in"GC"
Total"Processing"Time"
Dataset Size: 50M Transactions, Matching Duration: 1 week
Runtime: 27 min
Throughput: 1.8M / min
Runtime: 35 min
Throughput: 1.4M / min
Runtime: 98 min
Throughput: 0.5M / min
T R I P L E R U S H F R A U D D E T E C T I O N
D C O P S
Signal/Collect as a Platform
Images::BoInsogna,https://鍖ic.kr/p/9famxT,CC-NC-NDhttps://creativecommons.org/licenses/by-nc-nd/2.0/
Antana,https://鍖ic.kr/p/gGQPhA,CC-BY-SA,https://creativecommons.org/licenses/by-sa/2.0/,
JeffKubina,https://鍖ic.kr/p/2CG5PU,CC-BY-SA



y
Distributed Constraint
Optimization
z
q
x
x  y
y  z
z  x
q  z
x  {0,1,2}
y  {0,1}
z  {1,2}
q  {0,1,2}
Vertex Coloring in action
Optimized Version of DSA Running on a MacBook Pro with 8 workers

(slow, due to lots of IO for logging, bookkeeping, etc.)

 Scaled to: 10 Million vertices / variables
Verman and Bernstein, 2014
Industry Usage
!
We	
 are	
 using	
 Signal/Collect	
 to	
 analyze	
 millions	
 of	
 
claims	
 every	
 day	
 to	
 iden9fy	
 opportuni9es	
 for	
 our	
 
clients	
 to	
 save	
 money	
 through	
 be=er	
 healthcare	
 or	
 
avoiding	
 fraud,	
 waste,	
 and	
 abuse.	
 
US	
 Healthcare	
 Analy/cs	
 Company
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Processing Graph Streams
"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
147.1b  97.5b
147.1b  97.5b 
5100
Thu bernstein key_warp_speed
3mViewers	

300 Events/s
250 TV Channels	

25 frames/s
EPG for 7 days	

LOD enhanced
Thu bernstein key_warp_speed
Traditional TripleStore
http://www.mpi.de/
http://www.i鍖.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.i鍖.uzh.ch/
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Traditional TripleStore
http://www.mpi.de/
http://www.i鍖.uzh.ch/
http://www.uni-sb.de/courses/
http://www.universities.de/Saarbruecken http://www.i鍖.uzh.ch/
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.i鍖.uzh.ch/i_am_a_URI
http://www.lubm.org/teaches
http://www.lubm.org/
Semantic Flow
Processing
Semantic Flow
Processing
t
ViSTA-TV Viewership Data
0
7500
15000
22500
30000
0h 8h 16h 24h 8h 16h 24h 8h 16h 24h
Num of Valid Data Entries
day 1 day 2 day 3
Cache
ViSTA-足TV	
 project:	
 UserLog	
 and	
 EPG	
 streams
Semantic Flow
Processing is:
 Time-stamped tripes t = <s,p,o> [time] 	

 Semantic 鍖ow F = [t1, t2, ... tn]	

 Perform query matching on cached subset of F	

 Subject to Stressincoming data-rate
overwhelms the systems processing
capability
Context:	

Time Window
last 1 min
?
within 1 sec
Load Shedding
last 1 min
Eviction
last 1 min
?
Eviction is:
 Remove cached data	

 Note: Eviction may lower recall	

 Evict potential to produce future results	

 Types of eviction strategies (considered):	

 Random	

 Time-based (i.e, FIFO)	

 Least Recently Used (LRU)
Why not LRU?
LRU
C
B
Input: UserLog
,<watch>,ChannelUser 1 B
Join
EPG Cache
Two-way Join:	

UserLog EPG
Why not LRU?
LRU
C
B
User 1
Input: UserLog
,<watch>,Channel C
Join
EPG Cache
Why not LRU?
LRU
B
C
Input
Input: EPG
Channel , <Play>, Show 4DD
EPG Cache
D
User 3
Input: UserLog
,<watch>,Channel
?
Why not LRU?
LRU
C
Input: EPG
Channel , <Play>, Show 5E
B
Input: EPG
Channel , <Play>, Show 4D
InputInput: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
D
EPG Cache
CLOCK is:
 Consider both recency (LRU) and past
results	

 Giving each data entry a score	

 The score could be incremented and
depreciated	

 Named by the buffer management
algorithm CLOCK
CLOCK
CLOCK Score
6
1
3C
B
Input: UserLog
,<watch>,ChannelUser 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: UserLog
,<watch>,ChannelUser 2 C
Join
EPG Cache
CLOCK
CLOCK Score
6
1
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input
dep()
EPG Cache
CLOCK
CLOCK Score
5
1
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input
dep()
EPG Cache
CLOCK
CLOCK Score
5
0
4C
B
Input: EPG
Channel , <Play>, Show 4D
Input dep()
D
EPG Cache
CLOCK
CLOCK Score
5
init = 1
4C
Input: EPG
Channel , <Play>, Show 4D
Input dep()
D
EPG Cache
CLOCK
CLOCK Score
5
1
4C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
5
1
3C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
1
3C
Input
dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG Cache
CLOCK
CLOCK Score
4
0
3C
Input dep()
D
Input: EPG
Channel , <Play>, Show 5E EPG CacheE
CLOCK
CLOCK Score
4
1
3C
Input
dep()
E
EPG Cache
Input: EPG
Channel , <Play>, Show 6F
Input: EPG
Channel , <Play>, Show 7G
Input: EPG
Channel , <Play>, Show 8H
CLOCK Eviction is:
!
 The weight is adjusted by dep():	

 Linear: dep(w) = w - 1	

 Exponential: dep(w) = w *  (0<  < 1)
Recency History
Experimental Results
TV Viewership Data
Recall
0%
25%
50%
75%
100%
Cache Size
100% 50% 25% 15% 1%
Random FIFO LRU CLOCK
Two-way join query with 1,919,216 input triples
Experimental Results:	

Depreciation Function
Exponential factor  =  in {0.25, 0.5, 0.75, 0.95}
Limitations
 Static depreciation function: dep()	

 Experiment on other datasets	

 Real implementation to investigate other
performance metrics	

 Only local eviction strategies considered
CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0	

"IBM Electronic Data Processing Machine - GPN-2000-001881"
NASA, Public Domain @ Wikimedia Commons - 
http://commons.wikimedia.org/wiki/File:IBM_Electronic_Data_Processing_Machine_-_GPN-2000-001881.jpg
Semantic Web Reasoning
KB:Asserted Triples
Entailed KB: !
Asserted & Infered Triples
DLReasoning
InductiveR.
AnalogicalR.
YourR.
Signal/Collect
P. Stutz, A. Bernstein, and W. Cohen. Signal/Collect: Graph algorithms for the (semantic) web.
International Semantic Web ConferenceISWC 2010. Springer Berlin Heidelberg, 2010. 764-780.
CLOCK Eviction is:
!
 The weight is adjusted by dep():"
 Linear: dep(w) = w - 1"
 Exponential: dep(w) = w *  (0<  < 1)
Recency History
CCBY NASA http://www.鍖ickr.com/photos/nasa_jsc_photo/sets/72157629726792248/with/7197236116/0	

Philip Stutz, Mihaela Verman, Shen Gao,
Daniel Strebel, Bibek Paudel, Lorenz Fischer,
Thomas Keller, Robin Hafen, Genc Mazlami

More Related Content

Thu bernstein key_warp_speed