ºÝºÝߣshows by User: matei / http://www.slideshare.net/images/logo.gif ºÝºÝߣshows by User: matei / Thu, 12 Nov 2020 23:31:27 GMT ºÝºÝߣShare feed for ºÝºÝߣshows by User: matei Scaling Databricks to Run Data and ML Workloads on Millions of VMs /slideshow/scaling-databricks-to-run-data-and-ml-workloads-on-millions-of-vms/239227656 sbtb-scaling-databricks-201112233127
Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.]]>

Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.]]>
Thu, 12 Nov 2020 23:31:27 GMT /slideshow/scaling-databricks-to-run-data-and-ml-workloads-on-millions-of-vms/239227656 matei@slideshare.net(matei) Scaling Databricks to Run Data and ML Workloads on Millions of VMs matei Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/sbtb-scaling-databricks-201112233127-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.
Scaling Databricks to Run Data and ML Workloads on Millions of VMs from Matei Zaharia
]]>
1595 0 https://cdn.slidesharecdn.com/ss_thumbnails/sbtb-scaling-databricks-201112233127-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Making Data Timelier and More Reliable with Lakehouse Technology /slideshow/making-data-timelier-and-more-reliable-with-lakehouse-technology/238444684 futuredata-lakehouse-200910170722
Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.]]>

Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.]]>
Thu, 10 Sep 2020 17:07:22 GMT /slideshow/making-data-timelier-and-more-reliable-with-lakehouse-technology/238444684 matei@slideshare.net(matei) Making Data Timelier and More Reliable with Lakehouse Technology matei Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/futuredata-lakehouse-200910170722-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Enterprise data architectures usually contain many systems—data lakes, message queues, and data warehouses—that data must pass through before it can be analyzed. Each transfer step between systems adds a delay and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a radically new architecture: the lakehouse, an ACID transactional layer over cloud storage that can provide streaming, management features, indexing, and high-performance access similar to a data warehouse. Thousands of organizations including the largest Internet companies are now using lakehouses to replace separate data lake, warehouse and streaming systems and deliver high-quality data faster internally. I’ll discuss the key trends and recent advances in this area based on Delta Lake, the most widely used open source lakehouse platform, which was developed at Databricks.
Making Data Timelier and More Reliable with Lakehouse Technology from Matei Zaharia
]]>
1881 0 https://cdn.slidesharecdn.com/ss_thumbnails/futuredata-lakehouse-200910170722-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Scaling up Machine Learning Development /slideshow/scaling-up-machine-learning-development/229346777 mlflow-scaledml-2020-200228010109
An update on the open source machine learning platform, MLflow, given by Matei Zaharia at ScaledML 2020. Details on the new autologging and model registry features, and large scale use cases.]]>

An update on the open source machine learning platform, MLflow, given by Matei Zaharia at ScaledML 2020. Details on the new autologging and model registry features, and large scale use cases.]]>
Fri, 28 Feb 2020 01:01:09 GMT /slideshow/scaling-up-machine-learning-development/229346777 matei@slideshare.net(matei) Scaling up Machine Learning Development matei An update on the open source machine learning platform, MLflow, given by Matei Zaharia at ScaledML 2020. Details on the new autologging and model registry features, and large scale use cases. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/mlflow-scaledml-2020-200228010109-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> An update on the open source machine learning platform, MLflow, given by Matei Zaharia at ScaledML 2020. Details on the new autologging and model registry features, and large scale use cases.
Scaling up Machine Learning Development from Matei Zaharia
]]>
480 0 https://cdn.slidesharecdn.com/ss_thumbnails/mlflow-scaledml-2020-200228010109-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
MLflow: A Platform for Production Machine Learning /slideshow/mlflow-a-platform-for-production-machine-learning/205874652 mlflow-neurips-2019-191214234008
Presentation about MLflow and the ML Platforms / MLOps class of software systems at the NeurIPS 2019 ML Systems workshop.]]>

Presentation about MLflow and the ML Platforms / MLOps class of software systems at the NeurIPS 2019 ML Systems workshop.]]>
Sat, 14 Dec 2019 23:40:08 GMT /slideshow/mlflow-a-platform-for-production-machine-learning/205874652 matei@slideshare.net(matei) MLflow: A Platform for Production Machine Learning matei Presentation about MLflow and the ML Platforms / MLOps class of software systems at the NeurIPS 2019 ML Systems workshop. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/mlflow-neurips-2019-191214234008-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Presentation about MLflow and the ML Platforms / MLOps class of software systems at the NeurIPS 2019 ML Systems workshop.
MLflow: A Platform for Production Machine Learning from Matei Zaharia
]]>
1032 4 https://cdn.slidesharecdn.com/ss_thumbnails/mlflow-neurips-2019-191214234008-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Lessons from Large-Scale Cloud Software at Databricks /slideshow/lessons-from-largescale-cloud-software-at-databricks/196184276 socc-large-scale-v1-191122000617
Keynote by Matei Zaharia at SOCC 2019 Abstract: The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a reliable, scalable control plane that can manage thousands of customers at once and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment (for instance, autoscaling instead of assuming static clusters). These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (e.g. number of VMs used, data size, metadata size, number of concurrent users, etc). I’ll describe some of the common challenges that our new services face and some of the main ways that Databricks has extended and modified open source analytics software for the cloud environment (e.g., designing an autoscaling engine for Apache Spark and creating a transactional storage layer on top of S3 in the Delta Lake open source project). Bio: Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly on datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow open source machine learning platform at Databricks and is a PI in the DAWN Lab focusing on systems for ML at Stanford. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).]]>

Keynote by Matei Zaharia at SOCC 2019 Abstract: The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a reliable, scalable control plane that can manage thousands of customers at once and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment (for instance, autoscaling instead of assuming static clusters). These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (e.g. number of VMs used, data size, metadata size, number of concurrent users, etc). I’ll describe some of the common challenges that our new services face and some of the main ways that Databricks has extended and modified open source analytics software for the cloud environment (e.g., designing an autoscaling engine for Apache Spark and creating a transactional storage layer on top of S3 in the Delta Lake open source project). Bio: Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly on datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow open source machine learning platform at Databricks and is a PI in the DAWN Lab focusing on systems for ML at Stanford. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).]]>
Fri, 22 Nov 2019 00:06:17 GMT /slideshow/lessons-from-largescale-cloud-software-at-databricks/196184276 matei@slideshare.net(matei) Lessons from Large-Scale Cloud Software at Databricks matei Keynote by Matei Zaharia at SOCC 2019 Abstract: The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a reliable, scalable control plane that can manage thousands of customers at once and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment (for instance, autoscaling instead of assuming static clusters). These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (e.g. number of VMs used, data size, metadata size, number of concurrent users, etc). I’ll describe some of the common challenges that our new services face and some of the main ways that Databricks has extended and modified open source analytics software for the cloud environment (e.g., designing an autoscaling engine for Apache Spark and creating a transactional storage layer on top of S3 in the Delta Lake open source project). Bio: Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly on datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow open source machine learning platform at Databricks and is a PI in the DAWN Lab focusing on systems for ML at Stanford. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE). <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/socc-large-scale-v1-191122000617-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Keynote by Matei Zaharia at SOCC 2019 Abstract: The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software, which has not been heavily studied in research. I will explain some of these challenges based on my experience at Databricks, a startup that provides a data analytics platform as a service on AWS and Azure. Databricks manages millions of VMs per day to run data engineering and machine learning workloads using Apache Spark, TensorFlow, Python and other software for thousands of customers. Two main challenges arise in this context: (1) building a reliable, scalable control plane that can manage thousands of customers at once and (2) adapting the data processing software itself (e.g. Apache Spark) for an elastic cloud environment (for instance, autoscaling instead of assuming static clusters). These challenges are especially significant for data analytics workloads whose users constantly push boundaries in terms of scale (e.g. number of VMs used, data size, metadata size, number of concurrent users, etc). I’ll describe some of the common challenges that our new services face and some of the main ways that Databricks has extended and modified open source analytics software for the cloud environment (e.g., designing an autoscaling engine for Apache Spark and creating a transactional storage layer on top of S3 in the Delta Lake open source project). Bio: Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly on datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow open source machine learning platform at Databricks and is a PI in the DAWN Lab focusing on systems for ML at Stanford. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
Lessons from Large-Scale Cloud Software at Databricks from Matei Zaharia
]]>
6397 3 https://cdn.slidesharecdn.com/ss_thumbnails/socc-large-scale-v1-191122000617-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
What are the Unique Challenges and Opportunities in Systems for ML? /slideshow/what-are-the-unique-challenges-and-opportunities-in-systems-for-ml/196180458 ml-research-workshop-191121234528
Presentation by Matei Zaharia at the SOSP 2019 AI Systems workshop about the systems research challenges specific to machine learning systems, including debugging and performance optimization for ML. Covers research from Stanford DAWN and an industry perspective from Databricks.]]>

Presentation by Matei Zaharia at the SOSP 2019 AI Systems workshop about the systems research challenges specific to machine learning systems, including debugging and performance optimization for ML. Covers research from Stanford DAWN and an industry perspective from Databricks.]]>
Thu, 21 Nov 2019 23:45:28 GMT /slideshow/what-are-the-unique-challenges-and-opportunities-in-systems-for-ml/196180458 matei@slideshare.net(matei) What are the Unique Challenges and Opportunities in Systems for ML? matei Presentation by Matei Zaharia at the SOSP 2019 AI Systems workshop about the systems research challenges specific to machine learning systems, including debugging and performance optimization for ML. Covers research from Stanford DAWN and an industry perspective from Databricks. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ml-research-workshop-191121234528-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Presentation by Matei Zaharia at the SOSP 2019 AI Systems workshop about the systems research challenges specific to machine learning systems, including debugging and performance optimization for ML. Covers research from Stanford DAWN and an industry perspective from Databricks.
What are the Unique Challenges and Opportunities in Systems for ML? from Matei Zaharia
]]>
576 2 https://cdn.slidesharecdn.com/ss_thumbnails/ml-research-workshop-191121234528-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
https://cdn.slidesharecdn.com/profile-photo-matei-48x48.jpg?cb=1605223829 I am an Assistant Professor of Computer Science at Stanford and cofounder and Chief Technologist at Databricks. Working on systems for large-scale data processing, machine learning and cloud computing, including Apache Spark, Delta Lake and MLflow. cs.stanford.edu/~matei/ https://cdn.slidesharecdn.com/ss_thumbnails/sbtb-scaling-databricks-201112233127-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/scaling-databricks-to-run-data-and-ml-workloads-on-millions-of-vms/239227656 Scaling Databricks to ... https://cdn.slidesharecdn.com/ss_thumbnails/futuredata-lakehouse-200910170722-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/making-data-timelier-and-more-reliable-with-lakehouse-technology/238444684 Making Data Timelier a... https://cdn.slidesharecdn.com/ss_thumbnails/mlflow-scaledml-2020-200228010109-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/scaling-up-machine-learning-development/229346777 Scaling up Machine Lea...