ºÝºÝߣshows by User: goswamianjan / http://www.slideshare.net/images/logo.gif ºÝºÝߣshows by User: goswamianjan / Thu, 25 Jul 2019 05:56:23 GMT ºÝºÝߣShare feed for ºÝºÝߣshows by User: goswamianjan Learning to Diversify for E-commerce Search with Multi-Armed Bandit} /slideshow/diverse-ecom2019/157745718 diverseecom2019-190725055623
Search is central to e-commerce platforms. Diversification of search results is essential to cater to the diverse preferences of the customers. One of the primary metrics of e-commerce businesses is revenue. On the other hand, the prices of the products shown influence customer preferences. Hence, diversifying e-commerce search results requires learning the diverse price preferences of the customers and simultaneously maximizing the revenue without hurting the relevance of the results. In this paper, we introduce the learning to diversify problem for e-commerce search. We also show that diversification improves the median customer lifetime value (CLV), which is a critical long-term business metric for an e-commerce business. We design three algorithms for the task. The first two algorithms are modifications of algorithms that are in the past developed in the context of the diversification problem in web search. The third algorithm is a novel approximate knapsack based semi-bandit algorithm. We derive the regret and pay-off bounds of all these algorithms and conduct experiments with synthetic data and simulation to validate and compare the algorithms. We compute revenue, median CLV, and purchase based mean reciprocal rank (PMRR) under various scenarios such as with changing user preferences with time in our simulation to compare the performances of these algorithms. We show that our proposed third algorithm is more practical and efficient compared to the first two algorithms and can produce higher revenue, maintain a better median CLV and PMRR.]]>

Search is central to e-commerce platforms. Diversification of search results is essential to cater to the diverse preferences of the customers. One of the primary metrics of e-commerce businesses is revenue. On the other hand, the prices of the products shown influence customer preferences. Hence, diversifying e-commerce search results requires learning the diverse price preferences of the customers and simultaneously maximizing the revenue without hurting the relevance of the results. In this paper, we introduce the learning to diversify problem for e-commerce search. We also show that diversification improves the median customer lifetime value (CLV), which is a critical long-term business metric for an e-commerce business. We design three algorithms for the task. The first two algorithms are modifications of algorithms that are in the past developed in the context of the diversification problem in web search. The third algorithm is a novel approximate knapsack based semi-bandit algorithm. We derive the regret and pay-off bounds of all these algorithms and conduct experiments with synthetic data and simulation to validate and compare the algorithms. We compute revenue, median CLV, and purchase based mean reciprocal rank (PMRR) under various scenarios such as with changing user preferences with time in our simulation to compare the performances of these algorithms. We show that our proposed third algorithm is more practical and efficient compared to the first two algorithms and can produce higher revenue, maintain a better median CLV and PMRR.]]>
Thu, 25 Jul 2019 05:56:23 GMT /slideshow/diverse-ecom2019/157745718 goswamianjan@slideshare.net(goswamianjan) Learning to Diversify for E-commerce Search with Multi-Armed Bandit} goswamianjan Search is central to e-commerce platforms. Diversification of search results is essential to cater to the diverse preferences of the customers. One of the primary metrics of e-commerce businesses is revenue. On the other hand, the prices of the products shown influence customer preferences. Hence, diversifying e-commerce search results requires learning the diverse price preferences of the customers and simultaneously maximizing the revenue without hurting the relevance of the results. In this paper, we introduce the learning to diversify problem for e-commerce search. We also show that diversification improves the median customer lifetime value (CLV), which is a critical long-term business metric for an e-commerce business. We design three algorithms for the task. The first two algorithms are modifications of algorithms that are in the past developed in the context of the diversification problem in web search. The third algorithm is a novel approximate knapsack based semi-bandit algorithm. We derive the regret and pay-off bounds of all these algorithms and conduct experiments with synthetic data and simulation to validate and compare the algorithms. We compute revenue, median CLV, and purchase based mean reciprocal rank (PMRR) under various scenarios such as with changing user preferences with time in our simulation to compare the performances of these algorithms. We show that our proposed third algorithm is more practical and efficient compared to the first two algorithms and can produce higher revenue, maintain a better median CLV and PMRR. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/diverseecom2019-190725055623-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Search is central to e-commerce platforms. Diversification of search results is essential to cater to the diverse preferences of the customers. One of the primary metrics of e-commerce businesses is revenue. On the other hand, the prices of the products shown influence customer preferences. Hence, diversifying e-commerce search results requires learning the diverse price preferences of the customers and simultaneously maximizing the revenue without hurting the relevance of the results. In this paper, we introduce the learning to diversify problem for e-commerce search. We also show that diversification improves the median customer lifetime value (CLV), which is a critical long-term business metric for an e-commerce business. We design three algorithms for the task. The first two algorithms are modifications of algorithms that are in the past developed in the context of the diversification problem in web search. The third algorithm is a novel approximate knapsack based semi-bandit algorithm. We derive the regret and pay-off bounds of all these algorithms and conduct experiments with synthetic data and simulation to validate and compare the algorithms. We compute revenue, median CLV, and purchase based mean reciprocal rank (PMRR) under various scenarios such as with changing user preferences with time in our simulation to compare the performances of these algorithms. We show that our proposed third algorithm is more practical and efficient compared to the first two algorithms and can produce higher revenue, maintain a better median CLV and PMRR.
Learning to Diversify for E-commerce Search with Multi-Armed Bandit} from Anjan Goswami
]]>
312 2 https://cdn.slidesharecdn.com/ss_thumbnails/diverseecom2019-190725055623-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Discovery In Commerce Search /slideshow/discovery-in-commerce-search/112058796 bigdata0818-180829052055
Ranking in commerce search has several unique challenges. One of the specific challenges for a commerce search engine is maintaining a discovery mechanism that does not hurt its revenue, sales, or the customer experience significantly. In this talk, I will discuss a few practical and theoretically sound algorithms for discovery in a commerce search engine that can be nicely incorporated into an existing learning to rank (LTR) framework.]]>

Ranking in commerce search has several unique challenges. One of the specific challenges for a commerce search engine is maintaining a discovery mechanism that does not hurt its revenue, sales, or the customer experience significantly. In this talk, I will discuss a few practical and theoretically sound algorithms for discovery in a commerce search engine that can be nicely incorporated into an existing learning to rank (LTR) framework.]]>
Wed, 29 Aug 2018 05:20:55 GMT /slideshow/discovery-in-commerce-search/112058796 goswamianjan@slideshare.net(goswamianjan) Discovery In Commerce Search goswamianjan Ranking in commerce search has several unique challenges. One of the specific challenges for a commerce search engine is maintaining a discovery mechanism that does not hurt its revenue, sales, or the customer experience significantly. In this talk, I will discuss a few practical and theoretically sound algorithms for discovery in a commerce search engine that can be nicely incorporated into an existing learning to rank (LTR) framework. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/bigdata0818-180829052055-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Ranking in commerce search has several unique challenges. One of the specific challenges for a commerce search engine is maintaining a discovery mechanism that does not hurt its revenue, sales, or the customer experience significantly. In this talk, I will discuss a few practical and theoretically sound algorithms for discovery in a commerce search engine that can be nicely incorporated into an existing learning to rank (LTR) framework.
Discovery In Commerce Search from Anjan Goswami
]]>
173 2 https://cdn.slidesharecdn.com/ss_thumbnails/bigdata0818-180829052055-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation Applications /slideshow/machinelearned-ranking-algorithms-for-ecommerce-search-and-recommendation-applications/104031575 standaloneabstract-180703051306
Abstract]]>

Abstract]]>
Tue, 03 Jul 2018 05:13:06 GMT /slideshow/machinelearned-ranking-algorithms-for-ecommerce-search-and-recommendation-applications/104031575 goswamianjan@slideshare.net(goswamianjan) Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation Applications goswamianjan Abstract <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/standaloneabstract-180703051306-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Abstract
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation Applications from Anjan Goswami
]]>
574 5 https://cdn.slidesharecdn.com/ss_thumbnails/standaloneabstract-180703051306-thumbnail.jpg?width=120&height=120&fit=bounds document Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Controlled Experiments for Decision-Making in e-Commerce Search /slideshow/controlled-experiments-for-decisionmaking-in-ecommerce-search/59773682 ieeebigdataslides-160319211906
ºÝºÝߣs from IEEE Big Data presentation]]>

ºÝºÝߣs from IEEE Big Data presentation]]>
Sat, 19 Mar 2016 21:19:06 GMT /slideshow/controlled-experiments-for-decisionmaking-in-ecommerce-search/59773682 goswamianjan@slideshare.net(goswamianjan) Controlled Experiments for Decision-Making in e-Commerce Search goswamianjan ºÝºÝߣs from IEEE Big Data presentation <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ieeebigdataslides-160319211906-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> ºÝºÝߣs from IEEE Big Data presentation
Controlled Experiments for Decision-Making in e-Commerce Search from Anjan Goswami
]]>
544 4 https://cdn.slidesharecdn.com/ss_thumbnails/ieeebigdataslides-160319211906-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Spelling correction systems for e-commerce platforms /slideshow/spelling-correction-systems-for-ecommerce-platforms/59660708 spellnew-160317033138
This is a presentation on building a scalable machine learned spell correction system for an e-commerce site. However, most of the techniques are also generally applicable for any large consumer site.]]>

This is a presentation on building a scalable machine learned spell correction system for an e-commerce site. However, most of the techniques are also generally applicable for any large consumer site.]]>
Thu, 17 Mar 2016 03:31:38 GMT /slideshow/spelling-correction-systems-for-ecommerce-platforms/59660708 goswamianjan@slideshare.net(goswamianjan) Spelling correction systems for e-commerce platforms goswamianjan This is a presentation on building a scalable machine learned spell correction system for an e-commerce site. However, most of the techniques are also generally applicable for any large consumer site. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/spellnew-160317033138-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> This is a presentation on building a scalable machine learned spell correction system for an e-commerce site. However, most of the techniques are also generally applicable for any large consumer site.
Spelling correction systems for e-commerce platforms from Anjan Goswami
]]>
1678 7 https://cdn.slidesharecdn.com/ss_thumbnails/spellnew-160317033138-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Reputation systems /slideshow/reputation-systems/59654150 reputation-systems-160316220852
A brief survey of gaming in reputation systems.]]>

A brief survey of gaming in reputation systems.]]>
Wed, 16 Mar 2016 22:08:52 GMT /slideshow/reputation-systems/59654150 goswamianjan@slideshare.net(goswamianjan) Reputation systems goswamianjan A brief survey of gaming in reputation systems. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/reputation-systems-160316220852-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> A brief survey of gaming in reputation systems.
Reputation systems from Anjan Goswami
]]>
549 5 https://cdn.slidesharecdn.com/ss_thumbnails/reputation-systems-160316220852-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Topic Models Based Understanding of Supply and Demand Side of an eCommerce Engine /goswamianjan/topic-models-based-understanding-of-supply-and-demand-side-of-an-ecommerce-engine supplydemandanalysisv2-150618184635-lva1-app6891
The goal of an e-commerce business is to connect demand with supply. A deeper insight into this connection is thus necessary to make investment decisions on SEO, marketing, assortment choices, and technology. Topic models are unsupervised machine learning techniques for mining data in large corpora that can uncover the underlying semantic structure of the large data sets. These models have been successfully applied to various types of data such as text, images, and biological data etc. In this talk, I will discuss about how topic models can be applied to systematically understand the supply and demand of an e-commerce engine.]]>

The goal of an e-commerce business is to connect demand with supply. A deeper insight into this connection is thus necessary to make investment decisions on SEO, marketing, assortment choices, and technology. Topic models are unsupervised machine learning techniques for mining data in large corpora that can uncover the underlying semantic structure of the large data sets. These models have been successfully applied to various types of data such as text, images, and biological data etc. In this talk, I will discuss about how topic models can be applied to systematically understand the supply and demand of an e-commerce engine.]]>
Thu, 18 Jun 2015 18:46:34 GMT /goswamianjan/topic-models-based-understanding-of-supply-and-demand-side-of-an-ecommerce-engine goswamianjan@slideshare.net(goswamianjan) Topic Models Based Understanding of Supply and Demand Side of an eCommerce Engine goswamianjan The goal of an e-commerce business is to connect demand with supply. A deeper insight into this connection is thus necessary to make investment decisions on SEO, marketing, assortment choices, and technology. Topic models are unsupervised machine learning techniques for mining data in large corpora that can uncover the underlying semantic structure of the large data sets. These models have been successfully applied to various types of data such as text, images, and biological data etc. In this talk, I will discuss about how topic models can be applied to systematically understand the supply and demand of an e-commerce engine. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/supplydemandanalysisv2-150618184635-lva1-app6891-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> The goal of an e-commerce business is to connect demand with supply. A deeper insight into this connection is thus necessary to make investment decisions on SEO, marketing, assortment choices, and technology. Topic models are unsupervised machine learning techniques for mining data in large corpora that can uncover the underlying semantic structure of the large data sets. These models have been successfully applied to various types of data such as text, images, and biological data etc. In this talk, I will discuss about how topic models can be applied to systematically understand the supply and demand of an e-commerce engine.
Topic Models Based Understanding of Supply and Demand Side of an eCommerce Engine from Anjan Goswami
]]>
598 11 https://cdn.slidesharecdn.com/ss_thumbnails/supplydemandanalysisv2-150618184635-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Assessing product image quality for online shopping� /slideshow/spieimagequality/49396637 spieimagequality-150615090444-lva1-app6892
Assessing product-image quality is important in the context of online shopping. A high quality image that conveys more information about a product can boost the buyer’s confidence and can get more attention. However, the notion of image quality for product-images is not the same as that in other domains. The perception of quality of product-images depends not only on various photographic quality features but also on various high level features such as clarity of the foreground or goodness of the background etc. In this paper, we define a notion of product-image quality based on various such features. We conduct a crowdsourced experiment to collect user judgments on thousands of eBay’s images. We formulate a multi-class classification problem for modeling image quality by classifying images into good, fair and poor quality based on the guided perceptual notions from the judges. We also conduct experiments with regression using average crowd-sourced human judgments as target. We compute a pseudo-regression score with expected average of predicted classes and also compute a score from the regression technique. We design many experiments with various sampling and voting schemes with crowd-sourced data and construct various experimental image quality models. Most of our models have reasonable accuracies (greater or equal to 70%) on test data set. We observe that our computed image quality score has a high (0.66) rank correlation with average votes from the crowd sourced human judgments.]]>

Assessing product-image quality is important in the context of online shopping. A high quality image that conveys more information about a product can boost the buyer’s confidence and can get more attention. However, the notion of image quality for product-images is not the same as that in other domains. The perception of quality of product-images depends not only on various photographic quality features but also on various high level features such as clarity of the foreground or goodness of the background etc. In this paper, we define a notion of product-image quality based on various such features. We conduct a crowdsourced experiment to collect user judgments on thousands of eBay’s images. We formulate a multi-class classification problem for modeling image quality by classifying images into good, fair and poor quality based on the guided perceptual notions from the judges. We also conduct experiments with regression using average crowd-sourced human judgments as target. We compute a pseudo-regression score with expected average of predicted classes and also compute a score from the regression technique. We design many experiments with various sampling and voting schemes with crowd-sourced data and construct various experimental image quality models. Most of our models have reasonable accuracies (greater or equal to 70%) on test data set. We observe that our computed image quality score has a high (0.66) rank correlation with average votes from the crowd sourced human judgments.]]>
Mon, 15 Jun 2015 09:04:44 GMT /slideshow/spieimagequality/49396637 goswamianjan@slideshare.net(goswamianjan) Assessing product image quality for online shopping� goswamianjan Assessing product-image quality is important in the context of online shopping. A high quality image that conveys more information about a product can boost the buyer’s confidence and can get more attention. However, the notion of image quality for product-images is not the same as that in other domains. The perception of quality of product-images depends not only on various photographic quality features but also on various high level features such as clarity of the foreground or goodness of the background etc. In this paper, we define a notion of product-image quality based on various such features. We conduct a crowdsourced experiment to collect user judgments on thousands of eBay’s images. We formulate a multi-class classification problem for modeling image quality by classifying images into good, fair and poor quality based on the guided perceptual notions from the judges. We also conduct experiments with regression using average crowd-sourced human judgments as target. We compute a pseudo-regression score with expected average of predicted classes and also compute a score from the regression technique. We design many experiments with various sampling and voting schemes with crowd-sourced data and construct various experimental image quality models. Most of our models have reasonable accuracies (greater or equal to 70%) on test data set. We observe that our computed image quality score has a high (0.66) rank correlation with average votes from the crowd sourced human judgments. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/spieimagequality-150615090444-lva1-app6892-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Assessing product-image quality is important in the context of online shopping. A high quality image that conveys more information about a product can boost the buyer’s confidence and can get more attention. However, the notion of image quality for product-images is not the same as that in other domains. The perception of quality of product-images depends not only on various photographic quality features but also on various high level features such as clarity of the foreground or goodness of the background etc. In this paper, we define a notion of product-image quality based on various such features. We conduct a crowdsourced experiment to collect user judgments on thousands of eBay’s images. We formulate a multi-class classification problem for modeling image quality by classifying images into good, fair and poor quality based on the guided perceptual notions from the judges. We also conduct experiments with regression using average crowd-sourced human judgments as target. We compute a pseudo-regression score with expected average of predicted classes and also compute a score from the regression technique. We design many experiments with various sampling and voting schemes with crowd-sourced data and construct various experimental image quality models. Most of our models have reasonable accuracies (greater or equal to 70%) on test data set. We observe that our computed image quality score has a high (0.66) rank correlation with average votes from the crowd sourced human judgments.
Assessing product image quality for online shopping from Anjan Goswami
]]>
1527 7 https://cdn.slidesharecdn.com/ss_thumbnails/spieimagequality-150615090444-lva1-app6892-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Clustering /slideshow/clustering-48926800/48926800 clustering-150603063809-lva1-app6892
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analyzing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential k-means, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance]]>

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analyzing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential k-means, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance]]>
Wed, 03 Jun 2015 06:38:09 GMT /slideshow/clustering-48926800/48926800 goswamianjan@slideshare.net(goswamianjan) Clustering goswamianjan Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analyzing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential k-means, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/clustering-150603063809-lva1-app6892-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means. This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analyzing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down-loading all data and running sequential k-means, or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance
Clustering from Anjan Goswami
]]>
540 4 https://cdn.slidesharecdn.com/ss_thumbnails/clustering-150603063809-lva1-app6892-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
https://cdn.slidesharecdn.com/profile-photo-goswamianjan-48x48.jpg?cb=1739936944 Career Highlights: - 15+ years experience in Engineering, Research, and Product development. - 9+ years experience in advancing business using science and data. - 9+ years experience in search technologies with emphasis on ranking and applied sciences. - 3+ years experience in building, and leading high performance teams and delivering high impact applied science based products. - 7+ years experience in managing international teams, and technology development in asian countries. - 5+ years experience in managing academic collaborations with leading universities. - PhD in Computer Science (expected in 2015) with specialization in machine learning and information retrieval.... http://finaleanalyst.blogspot.com https://cdn.slidesharecdn.com/ss_thumbnails/diverseecom2019-190725055623-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/diverse-ecom2019/157745718 Learning to Diversify ... https://cdn.slidesharecdn.com/ss_thumbnails/bigdata0818-180829052055-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/discovery-in-commerce-search/112058796 Discovery In Commerce ... https://cdn.slidesharecdn.com/ss_thumbnails/standaloneabstract-180703051306-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/machinelearned-ranking-algorithms-for-ecommerce-search-and-recommendation-applications/104031575 Machine-Learned Rankin...