狠狠撸

What is Jubatus?
How it works for you?
NTT SIC Hiroki Kumazaki

Jubatus is…
? A Distributed Online Machine-Learning framework
? Distributed
– Fault-Tolerance
– Scale out
? Online
– Fixed time computation
? Machine-Learning
– More than “word count”!

Architecture
? ML model is combined with feature-extractor
Machine
Learning
Model
Feature
Extractor
Jubatus Server
Jubatus RPC

Architecture
? Distributed Computation
– Shared-Everything Architecture
? It’s fast and fault-tolerant!
Mix

Architecture
? It looks as if one server running.
Client
Jubatus RPC
Proxy

Architecture
? It looks as if one server running
– You can use single local Jubatus server for develop
– Multiple Jubatus server cluster for production
Client
Jubatus RPC
The same RPC！

Architecture
? With heavy load…
Client
Jubatus RPC
Proxy

Architecture
? Dynamically scale-out!
Client
Jubatus RPC
Proxy

Architecture
? Whenever servers break down
– Proxy conceals failures, so the service will continue.
Client
Jubatus RPC
Proxy

Architecture
? Multilanguage client library
– gem, pip, cpan, maven Ready!
– It essentially uses a messagepack-rpc.
? So you can use OCaml, Haskell, JavaScript, Go with your own
risk.
Client
Jubatus RPC

Architecture
? Many ML algorithms
– Classifier
– Recommender
– Anomaly Detection
– Clustering
– Regression
– Graph Mining
Useful!

Classifier
? Task: Classification of Datum
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == “__main__”:
print(fib(int(sys.argv[1])))
def fib(a)
if a == 1 or a == 0
1
else
end
end
if __FILE__ == $0
puts fib(ARGV[0].to_i)
end
Sample Task: Classify what programming language used
It’s It’s

Classifier
? Set configuration in the Jubatus server
ClassifierFreature
Extractor
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
Feature Extractor

Classifier
? Configuration JSON
– It does “feature vector design”
– very important step for machine learning
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf“
}
]
}
setteings for extract feature from string
define function named “bigram”
original embedded function “ngram”
pass “2” to “ngram” to create “bigram”
for all data
apply “bigram”
feature weights based on tf/idf
see wikipedia/tf-idf

Classifier
? Feature Extractor becomes “bigram extractor”
Classifierbigram
extractor

Feature Extractor
? What bigram extractor does?
bigram
extractor
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
if __name__ == “__main__”:
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Feature Vector

Classifier
? Training model with feature vectors
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Classifier
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
en 1
nd 1
key value
@a 1
$_ 1
... ...
my ...
su 1
ub 1
us 1
se 1
... ...

Classifier
? Set configuration in the Jubatus server
Classifier
"method" : "AROW",
"parameter" : {
"regularization_weight" : 1.0
}
Feature Extractor
bigram
extractor Classifier Algorithms
? Perceptron
? Passive Aggressive
? Confidence Weight
? Adaptive Regularization of Weights
? Normal Heｒd

Classifier
? Use model to classification task
– Jubatus will find clue for classification
AROW
key value
si 1
il 1
... ...
{| 1
... ...
It’s

Classifier
? Use model to classification task
– Jubatus will find clue for classification
AROW
key value
re 1
): 1
... ...
s[ 1
... ...
It’s

Via RPC
? call feature extraction and classification from
client via RPC
AROWbigram
extractor
lang = client.classify([sourcecode])
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
if __name__ == “__main__”:
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
It may be

What classifier can do?
? You can
– estimate the topic of tweets
– trash spam mail automatically
– monitor server failure from syslog
– estimate sentiment of user from blog post
– detect malicious attack
– find what feature is the best clue to classification

What classifier cannot do
? You cannot
– train model from data without supervised answer
– create a class without knowledge of the class
– get fine model without correct feature designing

How to use?
? see examples in
http://github.com/jubatus/jubatus-example
– gender
– shogun
– malware classification
– language detection

Recommender
? Task: what datum is similar to the datum?
Name
Star
Wars
Harry
Potter
Star Trek Titanic Frozen
John 4 3 2 2
Bob 5 3
Erika 1 3 4 5
Jack 2 5
Ann 4 5
Emily 1 4 2 5 4
Which movie should we recommend Ann?

Recommender
? Do recommendation based on Nearest Neighbor
Movie Rating(high-dimensional)
Science Fiction
Star Trek lover
John
Jack
Love Romance
Fantasy
Erika
Ann
StarWars lover
Bob
Emily
Near
Far

Recommender
? Ann and Emily is near
– we should recommend Flozen for Ann
Name
Star
Wars
Harry
Potter
Star Trek Titanic Frozen
Ann 4 5 ★
Emily 1 4 2 5 4
I bet Ann would like it!

Recommender with Feature Extractor
? Recommender server consist of Feature Extractor
and Recommender engine.
– Jubatus calculates distance between feature vectors
RecommenderFeature
Extractor
Recommender Engine can use
? Minhash
? Locality Sensitive Hashing
? Euclid Locality Sensitive Hashing
for defining distance.

Recommender with Feature Extractor
? Jubatus maps data in feature space
– There are distances between data
? How are they near or far?
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
Feature
Extractor
key value
im 1
mp 1
... ...
... ...
“{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
Recommender
Ruby
Python
Java

What Recommender can do?
? You can
– create recommendation engine in e-commerce
– calculate similarity of tweets
– find similar directional NBA player
– visualize distance between “Star Wars” and “Star Trek”

What Recommender cannot do?
? You cannot
– Label data(use classifier!)
– get decision tree
– get a-priori based recommendation

Anomaly Detection
? Task: Which datum is far from the others?

Anomaly Detection
? Task: Which datum is far from the others?
This One!

Anomaly Detection
? Distance based detection is not good
– We cannot decide appropriate threshold of distance
Distance is equal!

Anomaly Detection with Feature Extractor
? Anomaly detection server consist of Feature
Extractor and anomaly detection engine.
– Jubatus finds outlier from feature vectors
Anomaly
Detection
Feature
Extractor
Anomaly Detection Engine can use
? Minhash
? Locality Sensitive Hashing
? Euclid Locality Sensitive Hashing
for defining distance.

Anomaly Detection
? jubaanomaly can do it!
– It base on local outlier factor algorithm
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
Feature
Extractor
key value
im 1
mp 1
... ...
... ...
“{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
Anomaly
Detection
Outlier!

What Anomaly Detection can do?
? You (might) can
– find outlier
– grasp the trend and overview of current data stream
– detect or predict server's failure
– protect Web services from zero-day attacks

What Anomaly Detection cannot do?
? You cannot
– know the cluster distribution of data
– find any kinds of outliers with 100% accuracy
– easily understand how each outlier occurs
– know why a datum is assigned high outlier score

Conclusion
? Jubatus have embedded feature extractor with
algorithms.
? User should configure both feature extractor and
algorithm properly
? Client use configured machine learning via
Jubatus-RPC
? Classifier and Recommender and Anomaly may
be useful for your task.

DEMO
? I try to run the jubatus-example.

狠狠撸

What is jubatus? How it works for you?

Recommended

More Related Content

Viewers also liked (13)

Similar to What is jubatus? How it works for you? (20)

Recently uploaded (20)

What is jubatus? How it works for you?

Editor's Notes