Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu
?
This document outlines an item-based collaborative filtering recommendation algorithm that has been scaled up to run on Hadoop. It first discusses collaborative filtering techniques and how they work. It then describes scaling up the item-based collaborative filtering approach by dividing it into two steps: similarity computation and prediction/recommendation. The key computations involve calculating average item ratings, similarity between item pairs, and predicted ratings for target users. An experiment tested the scaled approach on a Hadoop cluster with 3 nodes.
Collaborative Filtering Recommendation Algorithm based on HadoopTien-Yang (Aiden) Wu
?
This document outlines an item-based collaborative filtering recommendation algorithm that has been scaled up to run on Hadoop. It first discusses collaborative filtering techniques and how they work. It then describes scaling up the item-based collaborative filtering approach by dividing it into two steps: similarity computation and prediction/recommendation. The key computations involve calculating average item ratings, similarity between item pairs, and predicted ratings for target users. An experiment tested the scaled approach on a Hadoop cluster with 3 nodes.
Scalable sentiment classification for big data analysis using naive bayes cla...Tien-Yang (Aiden) Wu
?
The document discusses evaluating the scalability of the Naive Bayes classifier for sentiment analysis on large datasets. It presents the Naive Bayes classification method, which uses Bayes' theorem with independence assumptions between features. It then describes implementing Naive Bayes in Hadoop for sentiment classification of movie reviews at scale, including preprocessing data, calculating word frequencies, and predicting sentiment. An experimental study tested the implementation on a Hadoop cluster with over 1,000 positive and 1,000 negative reviews for training.
K-means is an unsupervised learning algorithm that clusters data by minimizing distances between data points and cluster centers. It works by:
1. Randomly selecting K data points as initial cluster centers
2. Calculating the distance between each data point and cluster center and assigning the point to the closest center
3. Re-calculating the cluster centers based on the current assignments
4. Repeating steps 2-3 until cluster centers stop moving or a maximum number of iterations is reached.
The number of clusters K must be specified beforehand but the elbow method can help determine an appropriate value for K. Bisecting K-means is an alternative that starts with all data in one cluster and recursively splits clusters
Collaborative filtering is a technique used in recommender systems to predict a user's preferences based on other similar users' preferences. It involves collecting ratings or preference data from users, calculating similarities between users or items, and generating predictions for a user's unknown ratings based on weighted averages of the ratings from similar users or items. There are two main types: user-based which computes similarities between users, and item-based which computes similarities between items. Challenges include cold start problems, sparsity of data, scalability issues for large datasets, and reducing user bias.