The document discusses Python programming and data science tools like NumPy, Scikit-learn, and Cython. It provides examples of using NumPy to quickly sum a large array and speed up a prime number calculation with Cython. It also briefly mentions past Python conference talks and techniques like spectral clustering and activation functions.
The document discusses Python programming and data science tools like NumPy, Scikit-learn, and Cython. It provides examples of using NumPy to quickly sum a large array and speed up a prime number calculation with Cython. It also briefly mentions past Python conference talks and techniques like spectral clustering and activation functions.
Fast and Probvably Seedings for k-MeansKimikazu Kato
?
The document proposes a new MCMC-based algorithm for initializing centroids in k-means clustering that does not assume a specific distribution of the input data, unlike previous work. It uses rejection sampling to emulate the distribution and select initial centroids that are widely scattered. The algorithm is proven mathematically to converge. Experimental results on synthetic and real-world datasets show it performs well with a good trade-off of accuracy and speed compared to existing techniques.
This document discusses Python and machine learning libraries like scikit-learn. It provides code examples for loading data, fitting models, and making predictions using scikit-learn algorithms. It also covers working with NumPy arrays and loading data from files like CSVs.
Introduction to behavior based recommendation systemKimikazu Kato
?
Material presented at Tokyo Web Mining Meetup, March 26, 2016.
The source code is here:
https://github.com/hamukazu/tokyo.webmining.2016-03-26
東京ウェブマイニング(2016年3月27)の発表資料です。すべて英語です。
Recommendation System --Theory and PracticeKimikazu Kato
?
This document provides an overview of recommendation systems and collaborative filtering techniques. It discusses using matrix factorization to predict user ratings by representing users and items as vectors in a latent factor space. Optimization techniques like stochastic gradient descent can be used to learn the factorization from existing ratings. The document also notes challenges of sparsity and scale for practical systems and describes approaches like elastic net regularization and sparsification to address these.
Effective Numerical Computation in NumPy and SciPyKimikazu Kato
?
This document provides an overview of effective numerical computation in NumPy and SciPy. It discusses how Python can be used for numerical computation tasks like differential equations, simulations, and machine learning. While Python is initially slower than languages like C, libraries like NumPy and SciPy allow Python code to achieve sufficient speed through techniques like broadcasting, indexing, and using sparse matrix representations. The document provides examples of how to efficiently perform tasks like applying functions element-wise to sparse matrices and calculating norms. It also presents a case study for efficiently computing a formula that appears in a machine learning paper using different sparse matrix representations in SciPy.
Kimikazu Kato is the Chief Scientist at Silver Egg Technology, which provides recommender system and online advertising services. He has a PhD in computer science and experience in areas like computer graphics and parallel computing. Silver Egg uses a real-time recommender platform called Aigent Suite to consistently target users from initial visits to retention. The system analyzes user behavior data to determine personalized recommendations and ad targeting. While collaborative filtering and matrix factorization are common recommendation algorithms, approaches need adjustments for sales recommendations versus movie ratings. Consulting is also important for tuning algorithm parameters to specific business needs.
12. Netflix Prize
The Netflix Prize was an open competition for the best collaborative filtering
algorithm to predict user ratings for films, based on previous ratings
— Wikipedia
協調フィルタリングアルゴリズムを競う公開コンペティション
ユーザが映画につけた点数(1-5)について、過去の点数をもとに未知の点数を予想する。
Netflix社(米国のDVDレンタル会社)はそのために一部のデータを公開
2009年に終了
11
13. 映画点数(レーティング)予想
ユーザが見た映画について点数をつけたとする
movie
user W X Y Z
A 5 4 1 4
B 4
C 2 3
D 1 4 ?
知られていないユーザ?映画の組について、点数を当てることができるか?
点数付けは疎行列として表現される
行列のゼロ要素の意味するものは実際には「ゼロ」ではなくて「不明」
12
16. レーティング vs 購入
映画の点数 ショッピングのレコメンデーション
movie item
user W X Y Z user W X Y Z
A 5 4 1 4 A 1 1 1 1
B 4 B 1
C 2 3 C 1
D 1 4 ? D 1 1 ?
ユーザと映画のペアについて点数を予想 どのくらいの確率で買ってくれそうか予想
否定的な情報を含まない
この行列は否定的な情報を含む
(不明な要素について、買わない理由はわ
(ある映画を、ある人は「つまらない(1点)」 からない)
という評価をしている) → 1の側に強く引っ張られる
知られている要素は一種類の値しかない
→ 高い自由度を与える
映画の点数予想で有効な手法が、ショッピングサイトのレコメンデーション
にそのまま役立つとは限らない。.
15
17. Solutions
? 初期ゼロ要素(=不明要素)を(不明ではなく)
ゼロとして扱う
– 付け焼刃的だが、PMFをそのまま使うよりうまくいく
? ある割合については、初期ゼロ要素が最適化後ゼ
ロであるとして扱う [Sindhwani et al. 2010]
– 初期ゼロ要素に新たな変数を割り当て緩和問題を解く
– 実験的には、上記手法よりうまくいく
V.Sindhwani et al., One-Class Matrix Completion with Low-Density Factorizations. In Proc. of ICDM
2010: 1055-1060
16