Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
?
Adam Kawa shares his experiences working with a large, rapidly growing Hadoop cluster at Spotify. He details five "adventures" where various problems broke the cluster or made it unstable. These included issues with user permissions causing NameNode instability, DataNodes becoming blocked in deadlocks, Hive jobs being killed by the Fair Scheduler, and the JobTracker becoming slow due to overly large jobs. Each time, the problems were troubleshot and lessons were learned about proper cluster management, testing changes, and making data-driven decisions.
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
?
Adam Kawa shares his experiences working with a large, rapidly growing Hadoop cluster at Spotify. He details five "adventures" where various problems broke the cluster or made it unstable. These included issues with user permissions causing NameNode instability, DataNodes becoming blocked in deadlocks, Hive jobs being killed by the Fair Scheduler, and the JobTracker becoming slow due to overly large jobs. Each time, the problems were troubleshot and lessons were learned about proper cluster management, testing changes, and making data-driven decisions.
How Apache Drives Music Recommendations At SpotifyJosh Baer
?
The document discusses how Spotify utilizes Apache technologies, including Kafka, Hadoop, and Cassandra, to drive music recommendations for over 75 million users. It highlights the infrastructure's capability to handle large volumes of data and the various tools used for processing and personalizing playlists. The presentation emphasizes Spotify's commitment to data-driven music discovery and the collaborative open-source community surrounding Apache products.
The document outlines the evolution of big data at Spotify, focusing on their technical infrastructure and data processing methods. Highlights include the transition from a log archiver to Apache Kafka for improved data ingestion, and the move to Crunch for better performance and reliability in processing. Spotify's growth in user base and data processing capabilities has led to expanded use cases, including machine learning and advanced analytics.
Adam Kawa, a data engineer at Spotify, discusses the intricacies of Hadoop operations, specifically focusing on data analysis and infrastructure utilized by Spotify. The document covers numerous technical challenges, optimizations, and lessons learned from analyzing data metrics, enhancing performance in Hadoop, and addressing capacity planning. Various strategies are presented for improving data utilization and retention policies based on user behavior and operational metrics.
Danielle Jabin is a data engineer at Spotify who works on A/B testing infrastructure. She describes Spotify's big data landscape, which includes over 40 million active users generating 1.5 TB of compressed data per day. Spotify collects this user data using Kafka for high-volume data collection, processes it using Hadoop on a large cluster, and stores aggregates in databases like PostgreSQL and Cassandra for analytics and visualization.
This document provides an overview of Spotify's backend infrastructure, detailing its services, data management, and development processes. It highlights the use of various programming languages and technologies, including Python, C++, and Java, while addressing challenges such as data consistency and user-generated content. The document also touches on Spotify's approach to caching, P2P streaming, and the integration of Hadoop for analytics.
1) At Spotify, big data is used to answer important questions from various stakeholders like how many times songs have been streamed, most popular artists, and streaming numbers for marketing purposes.
2) Data infrastructure at Spotify includes a large Hadoop cluster with over 6 petabytes of data used to generate insights from user activity logs and improve the product.
3) Answering tricky questions requires techniques like A/B testing and analyzing streaming patterns to determine viral songs or artist reactions to new releases. Data-driven decisions are made to personalize the user experience.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
?
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.