ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
DMM.COM SPARK
2015/4 - DMM labo
API
AGENDA
DMM
Apache Spark
DMM
Tips
DMM
DMM
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
SPARK
UC Berkekey Apache
Scala, Python, Java, SQL, R API
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
(2014/09)
Mahout
Spark
Java, Scala, Python
GraphLab
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
WHY SPARK
MLlib, GraphX
Hadoop
Hadoop
item to item
user to item
popular
1. (Tracking API)
2. (Hive on Spark)
3. (Spark)
4. (Sqoop)
5. API(Play)
(TRACKING API)
Javascript
API
RDB Hadoop
(HIVE ON SPARK)
Spark
(SPARK - ITEM2ITEM)
val itemToItems = userProducts.join(userProducts).filter {
case (user, ((item1, keyword1, score1), (item2, keyword2, score2))) => item1
}.map {
case (user, ((item1, keyword1, score1), (item2, keyword2, score2))) => ((item
}.reduceByKey(_ + _).mapValues(math.sqrt(_)).map {
case ((item1, keyword1, item2), score) => ((item1, keyword1), (item2, score))
}.groupByKey().mapValues(_.toList.sortBy(_._2).reverse.take(config.numDisplayIt
case ((item1, keyword1), items) => items.size >= config.numDisplayItems
}.cache()
(SPARK -
USER2ITEM)
MLlib ALS( )
val model = ALS.train(ratings.map(_._1), config.alsRank,
config.alsNumIterations, config.alsLambda)
val predictions = model.predict(candidates).groupBy(_.user).map {
case (user, ratings) =>
(user, ratings.toList.sortBy(_.rating)
.reverse.take(config.numDisplayItems))
}.cache()
(SPARK)
RDB Hadoop
Sqoop MariaDB
API
item2item(id: ItemId): List[ItemId]
user2item(id: UserId): List[ItemId]
popular : List[ItemId]
DEPLOY AND EXECUTE
Jenkins + Build Pipeline + BuildFlow
(2015/09)
Jenkins + Build Pipeline + BuildFlow
Job Script + Git
Hive
Spark
Sqoop
Recommend API(Node.js)
MariaDB(Galera Cluster)
Jenkins + Build Pipeline + BuildFlow
Job Script + Management API
Hive on Spark
Spark
Sqoop
Recommend API(Play)
MariaDB(Galera Cluster)
Management API
File
Hive on Spark
Hive 3
Play
Spark, Hive UDF Util
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
AB PDCA
[ ]
701
75 % ¡ü
97% ¡ü
TIPS
use dataframes or datasets
hive
executor
memoryOverhead
cheat sheet
Top 5 Mistakes to Avoid When Writing Apache Spark
Applications
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
HIVE
Spark
HiveContext
Hive on Spark
DATAFRAMES DATASETS
(1.3 - ) Dataframes
(1.6 - ) Datasets
Project Tungsten(1.5 - )
DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹
Realtime Recommend
Dataframes & Datasets
Graphframes

More Related Content

DMM.com ¥é¥Ü¤Ï¤Ê¤¼Spark¤ò’ñÓä·¤¿¤Î¤«£¿ ¥ì¥³¥á¥ó¥É¥¨¥ó¥¸¥óé_°k¤ÎÑY‚Ȥò¤ªÔ’¤·¤Þ¤¹