Jubatus is a distributed online machine learning framework that is distributed, fault tolerant, and allows for fixed time computation. It combines a machine learning model with a feature extractor. Jubatus uses a shared-everything architecture that allows it to be fast and fault tolerant. The architecture allows clients to access Jubatus through a single RPC interface even as the number of Jubatus servers scales out dynamically. Jubatus supports various machine learning algorithms including classification, recommendation, anomaly detection, clustering, and regression.
Jubatus is an open source machine learning framework that allows for distributed, online machine learning. It features algorithms like classification, recommendation, anomaly detection, and clustering. The architecture uses a feature extractor to transform data into feature vectors which are then used to train machine learning models. Models are combined with feature extractors and accessed via client libraries using an RPC interface, enabling applications in languages like Ruby, Python, Perl, and JavaScript.
Optimizing mobile applications - Ian Dundore, Mark Harknessozlael ozlael
?
This document provides tips and best practices for optimizing mobile applications. It discusses using profiling tools like Instruments to identify performance issues. Specific topics covered include reducing startup time, examining callbacks and coroutines, identifying asset loads, and reducing memory usage. The document also provides recommendations for reducing CPU usage through techniques like splitting canvases, avoiding string manipulation, and replacing direct callbacks with an update manager.
The document discusses the Unity rendering pipeline and provides tips for optimizing shaders and rendering in Unity. Some key points:
- The Unity rendering pipeline is very flexible but can be difficult to configure for specific needs and targets.
- Built-in shaders are good for standard lighting models but not for stylized games or maximum performance. Custom shaders may be faster.
- Shader combinations allow using keywords to control shader variants. Material keywords make this configurable per object.
- Lit shader replacement and tags allow swapping shaders at runtime while keeping material properties.
- DX11 features like tessellation, random writes, and volume textures provide more flexibility but require custom shaders.
Designing an actor model game architecture with PonyNick Pruehs
?
Introduction to Pony, actor model, reference capabilities and making concurrent DirectX games with Pony.
Presented at MVP Fusion #3.
http://mvpfusion.azurewebsites.net/
This document provides an overview of key concepts in programming with Processing including:
- Functions like size() and background() that are used to set up sketches.
- The structure of a Processing sketch with setup() and draw() functions that control the flow.
- How mouse position and event listeners like mousePressed() allow for interactivity.
- Examples of functions, arguments, and return types.
- Suggested readings for learning more about generative art, data visualization, and variables in Processing.
Forth chapter of the lecture Unreal Engine Basics taught at SAE Institute Hamburg.
- Getting familiar with behavior trees in general
- Learning how to set up and use behavior trees in Unreal Engine
- Learning about the very basics of the Unreal Engine navigation system
Underscore.js is a utility library that provides support for functional programming and offers over 70 functions for working with arrays, objects, functions and more. It is commonly used for tasks like mapping, reducing, filtering collections as well as composing functions together through chaining and higher-order functions. While not focused on DOM manipulation like jQuery, Underscore is useful for both client-side and server-side JavaScript applications.
The document describes SkipGraph, a distributed hash table based on SkipLists. SkipGraph uses a SkipList data structure to store key-value pairs and distribute them across nodes through consistent hashing of keys to nodes. SkipGraph assigns each node a membership vector that describes which keys it stores, allowing efficient lookups of O(log n) by routing queries through nodes with overlapping membership vectors.
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapFelipe Prado
?
- Firmware Slap is a tool that automates the discovery of exploitable vulnerabilities in firmware using concolic analysis and function clustering. It recovers function prototypes from firmware binaries, runs automated analysis on the functions in parallel to find bugs, and visualizes the results in JSON and Elasticsearch/Kibana.
- The document discusses challenges with concolic analysis like memory usage and underconstraining symbolic values. It proposes techniques like starting analysis after initialization, modeling functions individually, and tracking memory more precisely.
- Function clustering is used to find similar functions that may contain similar bugs. Features are extracted from functions and k-means clustering is applied to group similar functions.
This document summarizes a project on implementing and evaluating parallel algorithms for connected components labeling on graphs using CPU (OpenMP) and GPU (CUDA). It studied different graph types and architectures. It proposed a simple autotuning approach to choose the best technique for a given graph by characterizing graphs based on features and employing the best algorithm. It discussed motivations, definitions, basic algorithms, optimizations, experiments on datasets, and future work including more sophisticated autotuning and heterogeneous algorithms.
This document introduces the Automated PenTest Toolkit (APT2), a tool that automates common penetration testing routines to speed up the testing process. It describes how APT2 works as a framework using an event-driven model and modular components. Modules listen for specific events, like the detection of an FTP service, and take actions like running an FTP bruteforcing tool. This automation can significantly reduce the time spent on routine testing tasks compared to manual processes. The document provides examples of how APT2 has accelerated the testing of 30 servers. It also discusses developing new modules and the goal of supporting additional tools and reporting capabilities.
This document introduces an Automated Penetration Testing Toolkit (APT2) that aims to automate common penetration testing routines. It discusses how penetration testing often follows repetitive routines that can be slow for large networks. APT2 addresses this by providing a framework and modules to automate routines like running Nmap, parsing outputs, checking for common exploits, and using tools like Metasploit. It describes the architecture of APT2 including its use of modules, an event queue, and knowledge base to automate routines. Examples of what it can automate and how modules are developed are also provided. Limitations around interactive tools and safety are discussed.
This document provides an overview of metasploitation and using the Metasploit framework. It discusses basics like vulnerabilities, exploits, payloads and encoders. It then covers using the msfconsole interface, exploit modules, auxiliary modules like scanners, databases integration, automation, client-side exploits, payload generation, backdooring files, Linux backdoors, Meterpreter, pivoting, and post-exploitation techniques. The document includes several screenshots and links resources for further information.
This document discusses using Puppet and related tools to automate the configuration and provisioning of development environments and servers. It covers using Vagrant and Puppet to set up local virtual machine environments, managing configurations with Puppet and Hiera, structuring code according to roles and profiles, integrating with version control and the Puppet Forge, and monitoring changes with tools like the Puppet Dashboard and MCollective. The document provides an overview of best practices and strategies for implementing infrastructure as code with Puppet.
Tommi Reiman discusses optimizing Clojure performance and abstractions. He shares lessons learned from optimizing middleware performance and JSON serialization. Data-driven approaches can enable high performance while maintaining abstraction. Reitit is a new routing library that aims to have the fastest performance through techniques like compiled routing data. Middleware can also benefit from data-driven approaches without runtime penalties. Overall performance should be considered but not obsessively, as many apps do not require extreme optimization.
Machine learning for IoT - unpacking the blackboxIvo Andreev
?
This document provides an overview of machine learning and how it can be applied to IoT scenarios. It discusses different machine learning algorithms like supervised and unsupervised learning. It also compares various machine learning platforms like Azure ML, BigML, Amazon ML, Google Prediction and IBM Watson ML. It provides guidance on choosing the right algorithm based on the data and diagnosing why machine learning models may fail. It also introduces neural networks and deep learning concepts. Finally, it demonstrates Azure ML capabilities through a predictive maintenance example.
Sparking Science up with Research RecommendationsMaya Hristakeva
?
Mendeley builds tools to help researchers, including a personalized article recommender called Mendeley Suggest. Mendeley Suggest uses various recommender system algorithms like collaborative filtering and matrix factorization to provide article recommendations. It has implemented these algorithms on different platforms like Apache Mahout, Spark, and MLlib to evaluate performance. Testing on their dataset of 15 million documents and 1 million users showed that a tuned user-based collaborative filtering approach on Spark performed best in terms of quality and cost compared to other algorithms and platforms. Mendeley aims to continue improving Mendeley Suggest by exploring new algorithms and platforms to provide high quality recommendations efficiently.
The document describes SkipGraph, a distributed hash table based on SkipLists. SkipGraph uses a SkipList data structure to store key-value pairs and distribute them across nodes through consistent hashing of keys to nodes. SkipGraph assigns each node a membership vector that describes which keys it stores, allowing efficient lookups of O(log n) by routing queries through nodes with overlapping membership vectors.
DEF CON 27 - CHRISTOPHER ROBERTS - firmware slapFelipe Prado
?
- Firmware Slap is a tool that automates the discovery of exploitable vulnerabilities in firmware using concolic analysis and function clustering. It recovers function prototypes from firmware binaries, runs automated analysis on the functions in parallel to find bugs, and visualizes the results in JSON and Elasticsearch/Kibana.
- The document discusses challenges with concolic analysis like memory usage and underconstraining symbolic values. It proposes techniques like starting analysis after initialization, modeling functions individually, and tracking memory more precisely.
- Function clustering is used to find similar functions that may contain similar bugs. Features are extracted from functions and k-means clustering is applied to group similar functions.
This document summarizes a project on implementing and evaluating parallel algorithms for connected components labeling on graphs using CPU (OpenMP) and GPU (CUDA). It studied different graph types and architectures. It proposed a simple autotuning approach to choose the best technique for a given graph by characterizing graphs based on features and employing the best algorithm. It discussed motivations, definitions, basic algorithms, optimizations, experiments on datasets, and future work including more sophisticated autotuning and heterogeneous algorithms.
This document introduces the Automated PenTest Toolkit (APT2), a tool that automates common penetration testing routines to speed up the testing process. It describes how APT2 works as a framework using an event-driven model and modular components. Modules listen for specific events, like the detection of an FTP service, and take actions like running an FTP bruteforcing tool. This automation can significantly reduce the time spent on routine testing tasks compared to manual processes. The document provides examples of how APT2 has accelerated the testing of 30 servers. It also discusses developing new modules and the goal of supporting additional tools and reporting capabilities.
This document introduces an Automated Penetration Testing Toolkit (APT2) that aims to automate common penetration testing routines. It discusses how penetration testing often follows repetitive routines that can be slow for large networks. APT2 addresses this by providing a framework and modules to automate routines like running Nmap, parsing outputs, checking for common exploits, and using tools like Metasploit. It describes the architecture of APT2 including its use of modules, an event queue, and knowledge base to automate routines. Examples of what it can automate and how modules are developed are also provided. Limitations around interactive tools and safety are discussed.
This document provides an overview of metasploitation and using the Metasploit framework. It discusses basics like vulnerabilities, exploits, payloads and encoders. It then covers using the msfconsole interface, exploit modules, auxiliary modules like scanners, databases integration, automation, client-side exploits, payload generation, backdooring files, Linux backdoors, Meterpreter, pivoting, and post-exploitation techniques. The document includes several screenshots and links resources for further information.
This document discusses using Puppet and related tools to automate the configuration and provisioning of development environments and servers. It covers using Vagrant and Puppet to set up local virtual machine environments, managing configurations with Puppet and Hiera, structuring code according to roles and profiles, integrating with version control and the Puppet Forge, and monitoring changes with tools like the Puppet Dashboard and MCollective. The document provides an overview of best practices and strategies for implementing infrastructure as code with Puppet.
Tommi Reiman discusses optimizing Clojure performance and abstractions. He shares lessons learned from optimizing middleware performance and JSON serialization. Data-driven approaches can enable high performance while maintaining abstraction. Reitit is a new routing library that aims to have the fastest performance through techniques like compiled routing data. Middleware can also benefit from data-driven approaches without runtime penalties. Overall performance should be considered but not obsessively, as many apps do not require extreme optimization.
Machine learning for IoT - unpacking the blackboxIvo Andreev
?
This document provides an overview of machine learning and how it can be applied to IoT scenarios. It discusses different machine learning algorithms like supervised and unsupervised learning. It also compares various machine learning platforms like Azure ML, BigML, Amazon ML, Google Prediction and IBM Watson ML. It provides guidance on choosing the right algorithm based on the data and diagnosing why machine learning models may fail. It also introduces neural networks and deep learning concepts. Finally, it demonstrates Azure ML capabilities through a predictive maintenance example.
Sparking Science up with Research RecommendationsMaya Hristakeva
?
Mendeley builds tools to help researchers, including a personalized article recommender called Mendeley Suggest. Mendeley Suggest uses various recommender system algorithms like collaborative filtering and matrix factorization to provide article recommendations. It has implemented these algorithms on different platforms like Apache Mahout, Spark, and MLlib to evaluate performance. Testing on their dataset of 15 million documents and 1 million users showed that a tuned user-based collaborative filtering approach on Spark performed best in terms of quality and cost compared to other algorithms and platforms. Mendeley aims to continue improving Mendeley Suggest by exploring new algorithms and platforms to provide high quality recommendations efficiently.
The document discusses developing an exploit from a vulnerability and integrating it into the Metasploit framework. It covers finding a buffer overflow vulnerability in an application called "Free MP3 CD Ripper", using tools like ImmunityDebugger and Mona.py to crash the application and gain control of EIP. It then shows using Mona.py to generate an exploit, testing it works, and submitting it to the Metasploit framework. It also provides an overview of Meterpreter and its capabilities.
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
?
Mendeley Suggest is a personalized article recommender system that recommends relevant research articles to researchers. It uses various recommender algorithms like collaborative filtering and content-based filtering. Spark has proven to be a good alternative to Mahout for the computation layer, though some tuning is required. User-based collaborative filtering has been shown to outperform item-based collaborative filtering and matrix factorization methods for Mendeley Suggest. Offline evaluation is important before deploying recommendations online to test performance and quality.
Title: Sista: Improving Cog¡¯s JIT performance
Speaker: Cl¨¦ment B¨¦ra
Thu, August 21, 9:45am ¨C 10:30am
Video Part1
https://www.youtube.com/watch?v=X4E_FoLysJg
Video Part2
https://www.youtube.com/watch?v=gZOk3qojoVE
Description
Abstract: Although recent improvements of the Cog VM performance made it one of the fastest available Smalltalk virtual machine, the overhead compared to optimized C code remains important. Efficient industrial object oriented virtual machine, such as Javascript V8's engine for Google Chrome and Oracle Java Hotspot can reach on many benchs the performance of optimized C code thanks to adaptive optimizations performed their JIT compilers. The VM becomes then cleverer, and after executing numerous times the same portion of codes, it stops the code execution, looks at what it is doing and recompiles critical portion of codes in code faster to run based on the current environment and previous executions.
Bio: Cl¨¦ment B¨¦ra and Eliot Miranda has been working together on Cog's JIT performance for the last year. Cl¨¦ment B¨¦ra is a young engineer and has been working in the Pharo team for the past two years. Eliot Miranda is a Smalltalk VM expert who, among others, has implemented Cog's JIT and the Spur Memory Manager for Cog.
The Dirty Little Secrets They Didn¡¯t Teach You In Pentesting Class Chris Gates
?
Derbycon 2011
This talk is about methodologies and tools that we use or have coded that make our lives and pentest schedule a little easier, and why we do things the way we do. Of course, there will be a healthy dose of Metasploit in the mix.
The document describes the journey of automating large scale enterprise crash dump analysis. It details how manual crash analysis used to be a slow and difficult process involving passing large files between experts. Through four steps of automation - automating analysis, adding a web frontend, integrating workflows, and enabling deep analysis in the browser - a tool called SuperDump was created that transformed the process. SuperDump reduced analysis time from days to minutes, enabled non-experts to analyze crashes, and improved productivity, security, and quality by making analysis scalable and easy.
This document introduces an Automated Penetration Testing Toolkit (APT2) that aims to automate routine penetration testing tasks. It describes how penetration testing often involves repeating simple steps like port scanning, checking for default credentials and known exploits. This process can be slow and tedious. APT2 addresses this by providing a framework and modules that can automate these routine tasks. It uses an event-driven model where modules listen for events like open ports and perform tasks in response. This allows large networks to be tested more quickly and consistently than doing the same tasks manually.
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuireDatabricks
?
Salesforce has created a machine learning framework on top of Spark ML that builds personalized models for businesses across a range of applications. Hear how expanding type information about features has allowed them to deal with custom datasets with good results.
By building a platform that automatically does feature engineering on rich types (e.g. Currency and Percentages rather than Doubles; Phone Numbers and Email Addresses rather than Strings), they have automated much of the work that consumes most data scientists¡¯ time. Learn how you can do the same by building a single model outline based on the application, and then having the framework customize it for each customer.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
?
The document discusses preparing data for machine learning by applying data quality techniques in Spark. It introduces concepts of data quality and machine learning data formats. The main part shows how to use User Defined Functions (UDFs) in Spark SQL to automate transforming raw data into the required formats for machine learning, making the process more efficient and reproducible.
Cutting Edge Computer Vision for EveryoneIvo Andreev
?
Microsoft offers a wide range of tools and advanced solutions to support you in managing computer vision related tasks.
From purely coding approaches with ML.NET, through zero-code ComputerVision.ai to advanced and flexible AI service in Azure ML, there is a solution for every need and each type of person.
From running on premises, through managed infrastructure to completely cloud services the speed of getting to the desired results and the return of investment are guaranteed.
Join this session to get insights about the options, deployment, pricing, pros and cons compared and select the most appropriate tech for your business case.
Preliminary Evaluation of Inspection Assistance Methods Using VR Simulating P...Kurata Takeshi
?
In delivery processes from pharmaceutical wholesaler's sales offices to the client's facilities, which are the target of this case study, an assistance system using handheld devices and paper slips is used. The introduction of AR assistance systems is beginning to be investigated to improve operations in these processes. However, as AR is not yet highly accepted in Japanese industry, a VR environment has been constructed to enable the cycle consisting of prototyping, comparison with existing systems, and sharing the results with stakeholders as efficiently as possible. Based on the preliminary evaluation results in this VR environment, the reproducibility and effectiveness of the VR simulation as well as the superiority of the AR assistance method were confirmed. Furthermore, feedback from various sessions, including the demo session at ISMAR2024, provided insights into the importance of balancing conventional concepts and physical environments with digital assistance methods.
David Boutry - A Senior Software EngineerDavid Boutry
?
David Boutry is a Senior Software Engineer based in New York with over eight years of experience in AWS, microservices, Python, and JavaScript. He led a team that enhanced data processing efficiency by 40% and previously optimized e-commerce applications in Seattle, boosting sales by 25%. Passionate about mentorship and community engagement, David organizes coding workshops and develops software solutions for nonprofits.
Battery charging technology for electric vehicle.pptxVirajPasare
?
Battery Charging : Level 1, Level 2, Level 3, Fast or DC Charging, Voltage and Current Specifications, Opportunity Charging, Multi Stage Charging, Bulk, Absorption, Float, Trickle Charging, Pulse Charging, Constant Current Charging, Constant Voltage Charging, Constant Current Constant Voltage Charging, Wire Less Charging, Battery Swapping.
Building a Strong Portfolio for Your Software Engineering CareerNavinda Dissanayake
?
This comprehensive presentation guides software engineering students and early-career professionals through the essential steps for creating and maintaining an impactful portfolio. It highlights the critical role portfolios play in differentiating candidates during internships and job applications. The presentation offers detailed guidance on crafting professional profiles across platforms like GitHub, LinkedIn, and personal websites, emphasizing consistency and professionalism. Participants will learn best practices for selecting high-quality projects, creating thorough documentation, and maintaining clean, professional code. Additionally, the session provides curated resources for sourcing innovative project ideas and engaging in open-source contributions. It addresses common pitfalls and mistakes to avoid, such as blindly following tutorials, using unreviewed AI-generated code, and poor version control practices. Lastly, practical tips are provided for effectively presenting portfolios on resumes, continuous portfolio maintenance, professional engagement, and strategies for ongoing skill enhancement.
6. Architecture
? It looks as if one server running
¨C You can use single local Jubatus server for develop
¨C Multiple Jubatus server cluster for production
Client
Jubatus RPC
The same RPC£¡
9. Architecture
? Whenever servers break down
¨C Proxy conceals failures, so the service will continue.
Client
Jubatus RPC
Proxy
10. Architecture
? Multilanguage client library
¨C gem, pip, cpan, maven Ready!
¨C It essentially uses a messagepack-rpc.
? So you can use OCaml, Haskell, JavaScript, Go with your own
risk.
Client
Jubatus RPC
12. Classifier
? Task: Classification of Datum
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == ¡°__main__¡±:
print(fib(int(sys.argv[1])))
def fib(a)
if a == 1 or a == 0
1
else
return fib(a-1) + fib(a-2)
end
end
if __FILE__ == $0
puts fib(ARGV[0].to_i)
end
Sample Task: Classify what programming language used
It¡¯s It¡¯s
13. Classifier
? Set configuration in the Jubatus server
ClassifierFreature
Extractor
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf¡°
}
]
}
Feature Extractor
14. Classifier
? Configuration JSON
¨C It does ¡°feature vector design¡±
¨C very important step for machine learning
"converter": {
"string_types": {
"bigram": {
"method": "ngram",
"char_num": "2"
}
},
"string_rules": [
{
"key": "*",
"type": "bigram",
"sample_weight": "tf",
"global_weight": "idf¡°
}
]
}
setteings for extract feature from string
define function named ¡°bigram¡±
original embedded function ¡°ngram¡±
pass ¡°2¡± to ¡°ngram¡± to create ¡°bigram¡±
for all data
apply ¡°bigram¡±
feature weights based on tf/idf
see wikipedia/tf-idf
16. Feature Extractor
? What bigram extractor does?
bigram
extractor
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == ¡°__main__¡±:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Feature Vector
17. Classifier
? Training model with feature vectors
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
Classifier
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
en 1
nd 1
key value
@a 1
$_ 1
... ...
my ...
su 1
ub 1
us 1
se 1
... ...
18. Classifier
? Set configuration in the Jubatus server
Classifier
"method" : "AROW",
"parameter" : {
"regularization_weight" : 1.0
}
Feature Extractor
bigram
extractor Classifier Algorithms
? Perceptron
? Passive Aggressive
? Confidence Weight
? Adaptive Regularization of Weights
? Normal He£òd
19. Classifier
? Use model to classification task
¨C Jubatus will find clue for classification
AROW
key value
si 1
il 1
... ...
{| 1
... ...
It¡¯s
20. Classifier
? Use model to classification task
¨C Jubatus will find clue for classification
AROW
key value
re 1
): 1
... ...
s[ 1
... ...
It¡¯s
21. Via RPC
? call feature extraction and classification from
client via RPC
AROWbigram
extractor
lang = client.classify([sourcecode])
import sys
def fib(a):
if a == 1 or a == 0:
return 1
else:
return fib(a-1) + fib(a-2)
if __name__ == ¡°__main__¡±:
print(fib(int(sys.argv[1])))
key value
im 1
mp 1
po 1
... ...
): 1
... ...
de 1
ef 1
... ...
It may be
22. What classifier can do?
? You can
¨C estimate the topic of tweets
¨C trash spam mail automatically
¨C monitor server failure from syslog
¨C estimate sentiment of user from blog post
¨C detect malicious attack
¨C find what feature is the best clue to classification
23. What classifier cannot do
? You cannot
¨C train model from data without supervised answer
¨C create a class without knowledge of the class
¨C get fine model without correct feature designing
24. How to use?
? see examples in
http://github.com/jubatus/jubatus-example
¨C gender
¨C shogun
¨C malware classification
¨C language detection
25. Recommender
? Task: what datum is similar to the datum?
Name
Star
Wars
Harry
Potter
Star Trek Titanic Frozen
John 4 3 2 2
Bob 5 3
Erika 1 3 4 5
Jack 2 5
Ann 4 5
Emily 1 4 2 5 4
Which movie should we recommend Ann?
26. Recommender
? Do recommendation based on Nearest Neighbor
Movie Rating(high-dimensional)
Science Fiction
Star Trek lover
John
Jack
Love Romance
Fantasy
Erika
Ann
StarWars lover
Bob
Emily
Near
Far
27. Recommender
? Ann and Emily is near
¨C we should recommend Flozen for Ann
Name
Star
Wars
Harry
Potter
Star Trek Titanic Frozen
Ann 4 5 ¡ï
Emily 1 4 2 5 4
I bet Ann would like it!
28. Recommender with Feature Extractor
? Recommender server consist of Feature Extractor
and Recommender engine.
¨C Jubatus calculates distance between feature vectors
RecommenderFeature
Extractor
Recommender Engine can use
? Minhash
? Locality Sensitive Hashing
? Euclid Locality Sensitive Hashing
for defining distance.
29. Recommender with Feature Extractor
? Jubatus maps data in feature space
¨C There are distances between data
? How are they near or far?
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
Feature
Extractor
key value
im 1
mp 1
... ...
... ...
¡°{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
Recommender
Ruby
Python
Java
30. What Recommender can do?
? You can
¨C create recommendation engine in e-commerce
¨C calculate similarity of tweets
¨C find similar directional NBA player
¨C visualize distance between ¡°Star Wars¡± and ¡°Star Trek¡±
31. What Recommender cannot do?
? You cannot
¨C Label data(use classifier!)
¨C get decision tree
¨C get a-priori based recommendation
34. Anomaly Detection
? Distance based detection is not good
¨C We cannot decide appropriate threshold of distance
Distance is equal!
35. Anomaly Detection with Feature Extractor
? Anomaly detection server consist of Feature
Extractor and anomaly detection engine.
¨C Jubatus finds outlier from feature vectors
Anomaly
Detection
Feature
Extractor
Anomaly Detection Engine can use
? Minhash
? Locality Sensitive Hashing
? Euclid Locality Sensitive Hashing
for defining distance.
36. Anomaly Detection
? jubaanomaly can do it!
¨C It base on local outlier factor algorithm
key value
pu 1
ut 1
... ...
{| ...
|m 1
m| 1
{| 1
Feature
Extractor
key value
im 1
mp 1
... ...
... ...
¡°{ 1
fo 1
... ...
key value
Ma 1
ap 1
... ...
in 1
nt 1
te 1
er 1
Anomaly
Detection
Outlier!
37. What Anomaly Detection can do?
? You (might) can
¨C find outlier
¨C grasp the trend and overview of current data stream
¨C detect or predict server's failure
¨C protect Web services from zero-day attacks
38. What Anomaly Detection cannot do?
? You cannot
¨C know the cluster distribution of data
¨C find any kinds of outliers with 100% accuracy
¨C easily understand how each outlier occurs
¨C know why a datum is assigned high outlier score
39. Conclusion
? Jubatus have embedded feature extractor with
algorithms.
? User should configure both feature extractor and
algorithm properly
? Client use configured machine learning via
Jubatus-RPC
? Classifier and Recommender and Anomaly may
be useful for your task.
#2: Hello, I¡¯ll speak about Jubatus.
You may heard about jubatus, but I¡¯m afraid you don¡¯t know jubatus well.
In this speak, I wish you¡¯d realize what jubatus can do, or how to use it for your task.
#3: Jubatus has 3 feature.
Jubatus is a distributed online machine-learning framework.
Distributed means resilient to machine failure.
And Jubatus can increase its performance for your task by coordinate multi-machine cluster.
Online means fixed time computation.
Jubatus developer carefully designed Jubatus API so that users can balance between performance and computation time.
Machine-Learning is key factor of Big Data age.
You¡¯ll need more than ¡°word count¡±
#4: This is a overview of Jubatus process.
This red rectangle is one Jubatus process.
Inside process, there is two component exists.
Feature Extractor and Machine-Learning-Model.
You can connect your program with jubatus via Jubatus RPC.
So you can do machine learning with client-server model.
#5: You can combine this process in cluster each other.
Jubatus in cluster communicate and make more fast and reliable machine learning.
Whole model is shared and resilient to machine failure.
#6: If there are many Jubatus servers running and continue to mixing
User can communicate with cluster via jubatus proxy as if it is single jubatus server.
#7: The communication protocol between Jubatus server and client is completely the same with that of Jubatus proxy and client.
It is useful for developers because they can run jubatus in local machine for developing environment, and deploy the client code for production clusters.
#8: A big benefit of distributed system, Jubatus can scale performance out.
In your production environment, if there is too heavy RPC request for the throughput of clusters
#9: You can append machine to cluster, cluster will increase its performance.
It is suitable for Cloud Computing era.
#10: And jubatus cluster is resilient for cluster failure.
Whenever servers break down, the proxy server conceal the machine failure so the service will continue.
So you can append or remove cluster machine dynamically.
#11: And Jubatus client library is implemented in many language.
you can get jubatus client library via gem, pip, cpan, maven.
If you want to use it in other language, you can use messagepack-rpc client with your own risk.
It will work! (I tried Javascript
#12: And Jubatus has many kind of machine-learning module.
You can use these machine learning rapidly.
Among 6 machine learning modules, Classsifier and Recommender and Anomaly Detection will be great help of you.
I¡¯ll introduce these 3 machine learning modules.
#13: classifier can classify data.
A sample task, you may want to detect programming language of source code.
In this case, you can classify language from sequence of text.
#14: First of all, you have to set configuration in the jubatus server.
The configuration is written in JSON.
#15: In this case, you choose embedded ngram function, and passing number 2 to ngram. You can get bigram function.
And set rule. In this rule, all data inserted will be handled with bigram.
Regulating the weights of words with tf/idf scheme.
#16: Now, the Feature Extractor becomes ¡°bigram extractor¡±
#17: with this bigram extractor, all datum to be splited into two character words.
¡°import¡± will become ¡°im¡±, ¡°mp¡±, ¡°po¡±, ¡°or¡±, ¡°rt¡± with bigram scheme.
This form of datum representation if Feature Vector.
bigram extractor extracts bigram from datum and get Feature Vector.
#18: You extracting feature vectors from many language source code.
Jubatus Classifier learns from feature vectors and create model.
#19: Next, the classifier algorithm should be configured.
You can select Classifier Algorithm from Perceptron or Passive Aggressive or the others.
#20: the trained model can classify datum from feature vector.
In this case, Jubatus classifier finds a Ruby characteristic feature like "{|"
and highly score for ruby, then Jubatus estimate this source code is Ruby.
#21: Another datum, Jubatus find Python characteristic feature like ¡°):¡±
Jubatus scores high for this feature and it estimate this source code should be python.
#22: You can do these procedure via Jubatus RPC.
On RPC, giving datum for classification, and Jubatus returns the classification result.
All you have to do is write precise JSON configuration and client source code.
#23: You can
estimate the topic of a tweet
trash spam mail automatically
monitor server failure from syslog
estimate sentiment from blog post
detect attacking via network
calculate what feature is the best clue to classification
#24: You cannot
train model from data without supervised answer
create a class without knowledge of the class
get fine model without correct feature designing
#25: Other information for using classifier is available at jubatus official example repository.
These 4 sample may be useful for study.
#26: Next Jubatus algorithm is recommender.
With this ¡°movie and review rating matrix¡± which movie should we recommend Ann?
Jubatus can answer.
#27: An imaginary field of highly dimensional rating space.
Star Wars lover and Star Trek lover is relatively close.
Both of them movie is a kind of Science Fiction.
Ann and Emily is relatively close.
These distance is useful for recommendation.
Because Preferences of the human is tend to be similar.
#29: Jubatus recommender server consists of Feature Extractor and recommender engine.
Feature extractor is completely the same with classifier¡¯s one.
Jubatus calculates distance between feature vectors.
#30: From former example, jubatus recommender extracts feature vector from source code, and recommender engine maps each vectors in feature space.
#31: You can
create recommendation engine
calculate similarity of tweets
find similar directional NBA player
visualize distance between ¡°Star Wars¡± and ¡°Star Trek
notice that you can use recommender more than recommender.
#32: Recommender is based on unsupervised algorithm.
So that
You cannot Labeling data(use classifier!)
get decision tree
And it is nearest-neighbor based recommendation so that
get a-priori based recommendation
#33: Another algorithm is Anomaly Detection
It calculates ¡°How this datum is far from others?¡±
#34: Jubatus can detect the outlier from mass of data.
#35: In easy way, you may use recommender¡¯s distance score for finding outlier
Distance is not homogeneous, it can not be used to discover outliers.
#36: anomaly detection server consists of Feature Extractor and anomaly detection engine.
Feature extractor is completely the same with classifier and recommender¡¯s one.
Jubatus finds outlier from feature vectors
#37: The same wit recommender, Jubatus detect anomaly from Feature Vector
You should access this procedure via RPC too.
#38: You (might) can
find outlier
detect or prediction of server¡¯s failure
protect service against zero-day attack
know the trend of the entire data stream
#39: You cannot
get mostly common datum
get cluster map of data
give a diagnosis the outlier reason automatically
#40: Jubatus have embedded feature extractor with algorithms.
User should configure both feature extractor and algorithm properly
Client use configured machine learning via Jubatus-RPC
Classifier and Recommender and Anomaly may be useful for your task.