- Apache Spark is an open-source cluster computing framework for large-scale data processing. It was originally developed at the University of California, Berkeley in 2009 and is used for distributed tasks like data mining, streaming and machine learning.
- Spark utilizes in-memory computing to optimize performance. It keeps data in memory across tasks to allow for faster analytics compared to disk-based computing. Spark also supports caching data in memory to optimize repeated computations.
- Proper configuration of Spark's memory options is important to avoid out of memory errors. Options like storage fraction, execution fraction, on-heap memory size and off-heap memory size control how Spark allocates and uses memory across executors.
This document discusses exactly once semantics in Apache Kafka 0.11. It provides an overview of how Kafka achieved exactly once delivery between producers and consumers. Key points include:
- Kafka 0.11 introduced exactly once semantics with changes to support transactions and deduplication.
- Producers can write in a transactional fashion and receive acknowledgments of committed writes from brokers.
- Brokers store commit markers to track the progress of transactions and ensure no data loss during failures.
- Consumers can read from brokers in a transactional mode and receive data only from committed transactions, guaranteeing no duplication of records.
- This allows reliable message delivery semantics between producers and consumers with Kafka acting as
- Apache Spark is an open-source cluster computing framework for large-scale data processing. It was originally developed at the University of California, Berkeley in 2009 and is used for distributed tasks like data mining, streaming and machine learning.
- Spark utilizes in-memory computing to optimize performance. It keeps data in memory across tasks to allow for faster analytics compared to disk-based computing. Spark also supports caching data in memory to optimize repeated computations.
- Proper configuration of Spark's memory options is important to avoid out of memory errors. Options like storage fraction, execution fraction, on-heap memory size and off-heap memory size control how Spark allocates and uses memory across executors.
This document discusses exactly once semantics in Apache Kafka 0.11. It provides an overview of how Kafka achieved exactly once delivery between producers and consumers. Key points include:
- Kafka 0.11 introduced exactly once semantics with changes to support transactions and deduplication.
- Producers can write in a transactional fashion and receive acknowledgments of committed writes from brokers.
- Brokers store commit markers to track the progress of transactions and ensure no data loss during failures.
- Consumers can read from brokers in a transactional mode and receive data only from committed transactions, guaranteeing no duplication of records.
- This allows reliable message delivery semantics between producers and consumers with Kafka acting as
Cloudera World Tokyo 2014 で発表した、 Strata + Hadoop World 2014 のレポートです。Cloudera 会長 Mike Olson のキーノートや、保険会社の事例、ソーシャルグラフ作成、ETLの課題、HBase のアーキテクチャなどについて紹介しています。
This document discusses the application of PostgreSQL in a large social infrastructure project involving smart meter management. It describes three main missions: (1) loading 10 million datasets within 10 minutes, (2) saving data for 24 months, and (3) stabilizing performance for large scale SELECT statements. Various optimizations are discussed to achieve these missions, including data modeling, performance tuning, reducing data size, and controlling execution plans. The results showed that all three missions were successfully completed by applying PostgreSQL expertise and customizing it for the large-scale requirements of the project.
7. 7Copyright ? 2013 NTT DATA Corporation
Hadoopは大量データ処理を現実的なものにしてくれた
? 従来技術は、必要充分な性能をもった後発のローエンド技術に
凌駕されてゆく流れに乗って、必要十分を追求するOSS
? 枯れた技術を安価に使いこなす
? 複数のOSSを適材適所、組み
合わせて使いこなす
? 品質をコントロールして使いこなす
? 従来技術では困難であった
領域にチャレンジするOSS
? 新たな領域特有の課題を解決する
? リスクをコントロールして使いこなす
? OSSは..
コモディティ製品を使い切る
先進的な技術(大容量データ処理)を
身近なものにする
性
能
1970 1980 1990 2000
Moore's Law
The number of transistors on the
chip doubles every 18 months.
Gilder’s Law
The bandwidth of network doubles
every 6 months.
Metcalfe's Law
The value of a network is
proportional to the square
of the number of users.
2010
Sparkの前にはHadoopが開いた道がある
20. 20Copyright ? 2013 NTT DATA Corporation
サミットの盛況ぶりを裏付けるコミュニティの成長
出典: The State of Spark(Matei Zaharia)
最近話題に上ることが多い
Stormなどと比較してもSparkも
負けていない
開発母体であるオープンソースコミュニティが成長してくると、
機能拡充や不具合への対応が充実してくる傾向がある
23. 23Copyright ? 2013 NTT DATA Corporation
着実に開発母体が大きくなっていることが分かる
出典: The State of Spark(Matei Zaharia)
Spark 0.9.0
142 contributors
2014/2
Githubの「contributors」の数の推移
24. 24Copyright ? 2013 NTT DATA Corporation
実は2012年からカンファレンスへの露出があった
出典: The State of Spark(Matei Zaharia)
Databricksの母体となった
AMPLABが主
25. 25Copyright ? 2013 NTT DATA Corporation
Hadoopとの連携を強めて利便性高くなってきた
出典: Big Data Research in the AMPLab:BDAS and Beyond(Michael Franklin )
「UC Berkeleyの既存プロダクト」と「Hadoopの既存エコシステム」の連携を強める
動きが見られ、先行するHadoopの機能を利用しやすくなっている
Sparkの入出力や
データの永続化に利用
高度なリソースマネー
ジメントに利用
Hiveで培われた
SQLによる分散処理を利用
26. 26Copyright ? 2013 NTT DATA Corporation
Yahoo台湾でパーソナライズに利用されている
一部の処理をHadoopから
Sparkに置き換えている
出典: Hadoop and Saprk Join Forces at Yahoo(Andy Feng)
台湾のショッピングサイトで
利用されている
企業ユースも少しずつ増えている様子である
(数人程度 x 数か月で移行したとのこと)
30. 30Copyright ? 2013 NTT DATA Corporation
インタラクティブ処理、繰り返し処理を利用しやすく
出典: The State of Spark(Matei Zaharia)
中間結果をHDFSに保持しながら処理
従来の細切れの処理
Sparkでは
既存言語から利用しやすい
HDFSを透過的に利用できる
Sparkが提供してくれる機能のイメージ