ºÝºÝߣshows by User: dbtsai / http://www.slideshare.net/images/logo.gif ºÝºÝߣshows by User: dbtsai / Thu, 06 Apr 2017 17:59:18 GMT ºÝºÝߣShare feed for ºÝºÝߣshows by User: dbtsai 2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai /slideshow/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-spark-summit-east-talk-by-db-tsai/74570250 2017-netflixs-recommendation-ml-pipeline-using-apache-spark-170406175918
Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.]]>

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.]]>
Thu, 06 Apr 2017 17:59:18 GMT /slideshow/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-spark-summit-east-talk-by-db-tsai/74570250 dbtsai@slideshare.net(dbtsai) 2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai dbtsai Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-170406175918-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai from DB Tsai
]]>
588 6 https://cdn.slidesharecdn.com/ss_thumbnails/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-170406175918-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at Spark Summit 2015 /slideshow/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit/49477658 2015-06large-scalelassoandelastic-netregularizedgeneralizedlinearmodelsatsparksummit-150616205501-lva1-app6891
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib. Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance. DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.]]>

Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib. Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance. DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.]]>
Tue, 16 Jun 2015 20:55:01 GMT /slideshow/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit/49477658 dbtsai@slideshare.net(dbtsai) 2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at Spark Summit 2015 dbtsai Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib. Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance. DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2015-06large-scalelassoandelastic-netregularizedgeneralizedlinearmodelsatsparksummit-150616205501-lva1-app6891-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we&#39;ll show how to train linear models with Elastic-Net regularization using MLlib. Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we&#39;ll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We&#39;ll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance. DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at Spark Summit 2015 from DB Tsai
]]>
3171 8 https://cdn.slidesharecdn.com/ss_thumbnails/2015-06large-scalelassoandelastic-netregularizedgeneralizedlinearmodelsatsparksummit-150616205501-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference /slideshow/2015-0117-lambda-architecture/43621736 2015-01-17lambdaarchitecture-150117171605-conversion-gate02
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements. In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are the de facto batch-processing system. In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is used in this layer. In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way. One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset. Why Spark for lambda architecture? Traditionally, different technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.]]>

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements. In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are the de facto batch-processing system. In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is used in this layer. In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way. One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset. Why Spark for lambda architecture? Traditionally, different technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.]]>
Sat, 17 Jan 2015 17:16:05 GMT /slideshow/2015-0117-lambda-architecture/43621736 dbtsai@slideshare.net(dbtsai) 2015 01-17 Lambda Architecture with Apache Spark, NextML Conference dbtsai Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements. In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are the de facto batch-processing system. In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is used in this layer. In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way. One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset. Why Spark for lambda architecture? Traditionally, different technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2015-01-17lambdaarchitecture-150117171605-conversion-gate02-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements. In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are the de facto batch-processing system. In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the &quot;gap&quot; caused by the batch layer&#39;s lag in providing views based on the most recent data. This layer&#39;s views may not be as accurate as the views provided by batch layer&#39;s views created with full dataset, so they will be eventually replaced by the batch layer&#39;s views. Traditionally, Apache Storm is used in this layer. In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way. One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset. Why Spark for lambda architecture? Traditionally, different technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference from DB Tsai
]]>
7310 5 https://cdn.slidesharecdn.com/ss_thumbnails/2015-01-17lambdaarchitecture-150117171605-conversion-gate02-thumbnail.jpg?width=120&height=120&fit=bounds presentation White http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference /slideshow/2014-1020-largescale-machine-learning-with-apache-spark/40514831 2014-10-20large-scalemachinelearningwithapachespark-141020175342-conversion-gate02
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.]]>

Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.]]>
Mon, 20 Oct 2014 17:53:42 GMT /slideshow/2014-1020-largescale-machine-learning-with-apache-spark/40514831 dbtsai@slideshare.net(dbtsai) 2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference dbtsai Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2014-10-20large-scalemachinelearningwithapachespark-141020175342-conversion-gate02-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Things Conference from DB Tsai
]]>
5574 8 https://cdn.slidesharecdn.com/ss_thumbnails/2014-10-20large-scalemachinelearningwithapachespark-141020175342-conversion-gate02-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
2014-08-14 Alpine Innovation to Spark /slideshow/20140814-alpine-invovation-to-spark/38015901 2014-08-14alpineinvovationtospark-140815011519-phpapp01
Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project: Bio: DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master's degree in Physics from National Taiwan University. ]]>

Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project: Bio: DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master's degree in Physics from National Taiwan University. ]]>
Fri, 15 Aug 2014 01:15:19 GMT /slideshow/20140814-alpine-invovation-to-spark/38015901 dbtsai@slideshare.net(dbtsai) 2014-08-14 Alpine Innovation to Spark dbtsai Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project: Bio: DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master's degree in Physics from National Taiwan University. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2014-08-14alpineinvovationtospark-140815011519-phpapp01-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Spark is rapidly catching fire with the machine learning and data science community for a number of reasons. Predominantly, it is making it possible to extend and enhance machine learning algorithms to a level we’ve never seen before. In this talk, we’ll give examples of two areas Alpine Data Labs has contributed to the Spark project: Bio: DB Tsai is a Machine Learning Engineer working at Alpine Data Labs. His current focus is on Big Data, Data Mining, and Machine Learning. He uses Hadoop, Spark, Mahout, and several Machine Learning algorithms to build powerful, scalable, and robust cloud-driven applications. His favorite programming languages are Java, Scala, and Python. DB is a Ph.D. candidate in Applied Physics at Stanford University (currently taking leave of absence). He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master&#39;s degree in Physics from National Taiwan University.
2014-08-14 Alpine Innovation to Spark from DB Tsai
]]>
2025 18 https://cdn.slidesharecdn.com/ss_thumbnails/2014-08-14alpineinvovationtospark-140815011519-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation White http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
2014-06-20 Multinomial Logistic Regression with Apache Spark /slideshow/2014-0620-mlor-36132297/36132297 2014-06-20mlor-140621003914-phpapp01
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.]]>

Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.]]>
Sat, 21 Jun 2014 00:39:14 GMT /slideshow/2014-0620-mlor-36132297/36132297 dbtsai@slideshare.net(dbtsai) 2014-06-20 Multinomial Logistic Regression with Apache Spark dbtsai Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2014-06-20mlor-140621003914-phpapp01-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it&#39;s with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
2014-06-20 Multinomial Logistic Regression with Apache Spark from DB Tsai
]]>
11618 8 https://cdn.slidesharecdn.com/ss_thumbnails/2014-06-20mlor-140621003914-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Large-Scale Machine Learning with Apache Spark /dbtsai/m-llib-sf-machine-learning mllib-sfmachinelearning-140509003134-phpapp02
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression. ]]>

Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression. ]]>
Fri, 09 May 2014 00:31:33 GMT /dbtsai/m-llib-sf-machine-learning dbtsai@slideshare.net(dbtsai) Large-Scale Machine Learning with Apache Spark dbtsai Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/mllib-sfmachinelearning-140509003134-phpapp02-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
Large-Scale Machine Learning with Apache Spark from DB Tsai
]]>
17982 34 https://cdn.slidesharecdn.com/ss_thumbnails/mllib-sfmachinelearning-140509003134-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Unsupervised Learning with Apache Spark /slideshow/unsupervised-learning-with-apache-spark/34280608 unsupervisedlearningwithsparkmllib-140505031133-phpapp01
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.]]>

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.]]>
Mon, 05 May 2014 03:11:33 GMT /slideshow/unsupervised-learning-with-apache-spark/34280608 dbtsai@slideshare.net(dbtsai) Unsupervised Learning with Apache Spark dbtsai Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/unsupervisedlearningwithsparkmllib-140505031133-phpapp01-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark&#39;s MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we&#39;ll dive into uses and implementations of Spark&#39;s K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera&#39;s Apache Spark development.
Unsupervised Learning with Apache Spark from DB Tsai
]]>
10165 10 https://cdn.slidesharecdn.com/ss_thumbnails/unsupervisedlearningwithsparkmllib-140505031133-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Multinomial Logistic Regression with Apache Spark /dbtsai/2014-0501-mlor 2014-05-01mlor-140505025944-phpapp01
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.]]>

Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.]]>
Mon, 05 May 2014 02:59:44 GMT /dbtsai/2014-0501-mlor dbtsai@slideshare.net(dbtsai) Multinomial Logistic Regression with Apache Spark dbtsai Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/2014-05-01mlor-140505025944-phpapp01-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it&#39;s with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer. Bio: DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Multinomial Logistic Regression with Apache Spark from DB Tsai
]]>
12950 19 https://cdn.slidesharecdn.com/ss_thumbnails/2014-05-01mlor-140505025944-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation White http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
https://cdn.slidesharecdn.com/profile-photo-dbtsai-48x48.jpg?cb=1594187914 Big Data Machine Learning Engineer with strong computer science, theoretical physics and mathematical background. I've deep understanding of implementing data mining algorithms in a scalable ways, not just using them as consumers. I'm a big fan of Scala, and have been using it to develop scalable and distributed data mining algorithms with Apache Spark. I've involved with open source Apache Spark development as a contributor. Apache Spark is a fast and general engine for large-scale data processing, and it fits into the Hadoop open-source ecosystem. Specialties: • Machine Learning and Data Mining. • Distributed/Parallel Computing and Big Data Processing. • Expert in Apache Hadoop www.dbtsai.com https://cdn.slidesharecdn.com/ss_thumbnails/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-170406175918-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/2017-netflixs-recommendation-ml-pipeline-using-apache-spark-spark-summit-east-talk-by-db-tsai/74570250 2017 Netflix&#39;s Recomme... https://cdn.slidesharecdn.com/ss_thumbnails/2015-06large-scalelassoandelastic-netregularizedgeneralizedlinearmodelsatsparksummit-150616205501-lva1-app6891-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit/49477658 2015-06-15 Large-Scale... https://cdn.slidesharecdn.com/ss_thumbnails/2015-01-17lambdaarchitecture-150117171605-conversion-gate02-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/2015-0117-lambda-architecture/43621736 2015 01-17 Lambda Arch...