The Krylov Project is the key component in eBay's AI Platform initiative that provides an easy to use, open, and fast AI orchestration engine that is deployed as managed services in eBay cloud.
Using Krylov, AI scientists can access eBay's massive datasets; build and train AI models; spin up powerful compute (high-memory or GPU instances) on the Krylov compute cluster; and set up machine learning pipelines, such as using declarative constructs that stitch together pipeline lifecycle.
1 of 34
Downloaded 52 times
More Related Content
S8277 - Introducing Krylov: AI Platform that Empowers eBay Data Science and Engineering Teams
1. Introducing Krylov
eBay AI Platform - Machine Learning Made Easy
GPU Technology Conference, 2018
Henry Saputra
Technical Lead for Krylov - eBay Unified AI Platform
2. 1. Data Science and Machine Learning at eBay
2. Introducing Krylov
3. Compute Cluster and Accelerator Support with Nvidia GPU
4. Quickstart Example
5. Future Roadmap
6. Q & A
Agenda
4. eBay Patterns - Tools and Frameworks
Tools
? Languages: R, Python, Scala, C++
? IDE-like: RStudio, Notebooks (Juptyer), Python IDE
? Frameworks: NumPy, SciPy, matplotlib, Scikit-learn, Spark MLLib, H2O
Weka, XGBoost, Moses
? Pipelines: Cron, Luigi, Apache Airflow, Apache Oozie
Patterns for ML Training
? Single node
? Distributed training
? Deep learning (GPUs)
Deep LearningDistributed Training Key takeaway = CHOICE
1. Flexibility of software
2. Flexibility of hardware
configuration
5. 1. 50%-70% is plumbing work
a. Accessing and moving secured data
b. Environment and tools setup
c. Sub-optimal compute instances - NVIDIA GPUs and High memory/ CPUs instances
d. Long wait time from platform and infrastructure
2. Lost of productivity and opportunities
a. ML lifecycle management of models and features
b. Building robust training model pipelines: prepare data, algorithm, hyperparameters tuning, cross
validation
3. Collaborations almost impossible
4. Research vs Applied ML
Problems and Challenges
7. Krylov is the core project of the eBay unified AI Platform initiative to enable easy to use and
powerful cloud-based data science and machine learning platform.
The objective of the project is to enable machine learning jobs with easy access to
secured-data and eBay cloud computing resources.
The main goals for the Krylov initiative are:
Easy and secure access to training datasets
Access to compute in high performance machines, such as GPUs, or cluster of
machines.
Familiar tools and flexible software to run machine learning model training jobs
Interactive data analysis and visualization, with multi-tenancy support to allow quick
prototyping of algorithms and data access
Sharing and collaboration of ML work between teams in eBay
Overview
8. ML Lifecycle Management
Lifecycle
MODEL INFERENCING
Deployable, Scalable
MODEL BUILDING
Interactive, iterative
MODEL RE-FITTING
Interactive, iterative
MODEL RE-TRAINING
Interactive, iterative
Data + Lifecycle Management
MODEL TRAINING
Automatable, repeatable, scalable
10. eBay AI Platform Components
Infrastructure - Krylov
AI Engine - Krylov
Learning
Pipelines
Model
Experimentation
Data Scientist
Workspaces
Model Lifecycle
Management
GPU Tall instances
Fast Storage
Data
Preparation
Movement
Discovery
Access
AI Hub
(Shared
Repository)
AI
Modules
Speech Recognition Machine Translation
Computer Vision Information Retrieval
Natural Language Understanding
Inferencing
12. 1. Client Command Line Interface (CLI) via krylovctl program
2. ML Application and Run Specification
3. ML Pipelines: Workflow and Workspace
4. Namespaces - For quota and data isolation
5. Jobs and Runs - Managed by Krylov Tools and Minions
6. Secure Data Access - HDFS, NFS, OpenStack Swift, Custom
Krylov Main Features and Concepts
14. Krylov ML Application is a versioned unit of deployment that contains declaration of the
developers programs
Implemented as client project used as source to build deployment artifact
Three main parts:
mlapplication.json and artifact.sjon configuration files
Source code of the programs
Dependencies management via Dockerfile
Supported types of programs: JVM languages (Java, Scala), Python, Shell script
Using the ML Application as source, developers can build deployment artifact that can be
used by the Run Specification file to deploy it into one of the nodes in the cluster
Krylov ML Application
16. The Krylov Run Specification is a runtime configuration to add override configuration and
parameter passing for each Task in the ML Application job submissions
It tells Krylov master API server of which the artifact created by ML Application will be used in
the compute cluster
Defined as runspec.json file or can be passed as argument to krylovctl client program.
The runspec.json file also has definition for the compute resources, such as which NVIDIA
GPUs to use, CPU, memory, and which Docker image for dependencies used in ML
Application programs
Krylov Run Specification
18. Krylov ML batch lifecycle pipeline is defined as Krylov Workflow definition
Declarative
Default Generic Workflow
Important concepts for Krylov Workflow:
Workflow - A single pipeline defined within Krylov and the unit of deployment for an ML Application
Each Workflow contains one or more Tasks
The Tasks are connected to each other as Directed Acyclic Graph (DAG) structure
Task - smallest unit of execution that run developers Program and executed in a single machine
Flows - Contains one or more key-value pairs of name and declaration of Tasks DAGs
Flow - The chosen key that will be run from possible selection in the Flows definition
Krylov ML Pipelines: Workflow
21. A Workspace is an interactive web application to allow developers to use web
browser to do ML model prototyping, data preparation and exploration
The Workspace is run as Jupyter Notebook servers and launched on high CPU/
memory or NVIDIA GPU instances
Enhance the JupyterHub project to allow distributed launching of multi-tenants
Jupyter Notebook servers in Krylov compute cluster using Kubernetes
Krylov Workspace uses configuration file on creation time to override and
customize default parameters
Krylov ML Pipelines: Workspace
30. 1. Download krylovctl program from Krylov release repository
2. Run `krylovctl project create` to create new project in the local machine
3. Update or add code to the Krylov project for the machine learning programs
4. Register them as Program within a Task in the mlapplication.json
5. Add new Flow for the defined Tasks to construct the Workflow as a Directed Acyclic Graph (DAG)
6. Run `krylovctl project build` to build the project.
7. Run `krylovctl artifact create` to copy the runnables of the program into an artifact file
8. Run `krylovctl artifact upload` to upload the artifact file for remote execution
9. Run `krylovctl job run` for local execution, or `krylovctl job submit` for running it in the computing
cluster
Steps to Submit Krylov Workflow Job with CLI
33. 1. Inferencing Platform
2. Exploration and documentation of RESTful APIs for job management
3. Data Source and Dataset abstraction via Krylov SDKs
4. Managed ML Pipelines - Computer Vision, NLP, Machine Translation
5. Distributed Deep Learning
6. AutoML - Hyper Parameters Tuning
7. AI Hub to share ML Applications and Datasets
Future Roadmap