Implementing and running a secure datalake from the trenches DataWorks Summit
?
The document discusses the implementation and operation of a secure data lake by Orange and Octo Technology, highlighting their extensive operations and customer reach globally. It outlines challenges such as security, performance, and integration of analytical tools while emphasizing the evolution from a single-use case architecture to a comprehensive data lake. The future focus includes enhancing security measures, improving development methodologies, and unlocking new capabilities such as machine learning and graph processing.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
?
The document discusses the transition to a data-centric architecture at Paytm Labs, emphasizing the importance of data lakes for efficiently managing diverse data types without enforcing rigid schemas. Adam Muise, Chief Architect at Paytm Labs, presents insights on using lambda architecture for combining batch and stream processing to enhance analytics capabilities. The presentation highlights the benefits of data lakes in handling large volumes of data, enabling real-time processing, and supporting advanced analytics applications.
Implementing and running a secure datalake from the trenches DataWorks Summit
?
The document discusses the implementation and operation of a secure data lake by Orange and Octo Technology, highlighting their extensive operations and customer reach globally. It outlines challenges such as security, performance, and integration of analytical tools while emphasizing the evolution from a single-use case architecture to a comprehensive data lake. The future focus includes enhancing security measures, improving development methodologies, and unlocking new capabilities such as machine learning and graph processing.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
?
The document discusses the transition to a data-centric architecture at Paytm Labs, emphasizing the importance of data lakes for efficiently managing diverse data types without enforcing rigid schemas. Adam Muise, Chief Architect at Paytm Labs, presents insights on using lambda architecture for combining batch and stream processing to enhance analytics capabilities. The presentation highlights the benefits of data lakes in handling large volumes of data, enabling real-time processing, and supporting advanced analytics applications.
Apache Flume is an open-source, reliable data ingestion system designed for large-scale data aggregation in the big data ecosystem, capable of handling continuous data production from various sources. It features customizable and extensible architecture with a distributed pipeline, ensuring transactional guarantees and scalability. Although it effectively manages large volumes of data, it has limitations such as handling poison events and centralized configuration needs.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
?
The document provides an overview of streaming and messaging technologies, comparing batch processing with streaming data processing, and discussing various tools such as Flume, Kafka, AWS SQS, and AWS Kinesis. It highlights the pros and cons of each tool, their use cases, and the challenges associated with streaming data, including scalability, fault tolerance, and consistency. Additionally, it explains how these technologies can be integrated to enhance application performance and data handling capabilities.
This document provides an overview and comparison of the Avro and Parquet data formats. It begins with introductions to Avro and Parquet, describing their key features and uses. The document then covers Avro and Parquet schemas, file structures, and includes code examples. Finally, it discusses considerations for choosing between Avro and Parquet and shares experiences using the two formats.
The document provides an overview of building a data science team, primarily within the context of Paytm, a major Indian payments company. It emphasizes the importance of having data engineers alongside data scientists, securing budget and resources, and establishing a data architecture, specifically a data lake. The document outlines team roles, management practices, and key strategies for successful data science implementation and operation.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
This document summarizes a benchmark study of file formats for Hadoop, including Avro, JSON, ORC, and Parquet. It found that ORC with zlib compression generally performed best for full table scans. However, Avro with Snappy compression worked better for datasets with many shared strings. The document recommends experimenting with the benchmarks, as performance can vary based on data characteristics and use cases like column projections.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
?
This document compares various HDFS data storage formats, including Avro, Parquet, and ORC, focusing on their use cases, strengths, and weaknesses. It outlines considerations for choosing a format based on write and read requirements, schema evolution, and data types. The document concludes with examples and practical implications of schema evolution in data management.
Flume is an Apache project for log aggregation and movement, optimized for Hadoop ecosystems. It uses a push model with agents and channels. Kafka is a distributed publish-subscribe messaging system optimized for high throughput and availability. It uses a pull model and supports multiple consumers. Kafka generally has higher throughput than Flume. Flume and Kafka can be combined, with Flume using Kafka as a channel or source/sink, to take advantage of both systems.
7. 数据架构设计:Cache
? Design & Development:
? 尽量避免同一份数据多个domain各自Cache: stale/inconsistent data
? 同一份数据的不同修改入口: stale/Inconsistent data
? 大对象的Cache的频繁调用: 带宽瓶颈,监控定位
? Hot Key的正确Cache/并发更新和回源:
? Cache Server down的处理:
? by design, cache is just cache, just like cache miss , function should be ok
? 中间件的thread/load 瓶颈和连接池超时处理
? Control single instance load/traffic/qps
? Backend Database capacity/Second Tier Cache
? Deployment & Operation:
? 跨IDC的Cache调用和更新: 带宽和延迟
? Cache instance之间的带宽竞争:快速定位和隔离
? 万兆网络 is the king
14. 扩展性-Scale Out-拆分
? 永远是先拆分,再shard
? 希望大家知道拆分(Split)和分区(shard)的区别
? wait:
? did you know how much a mordern system can handle?
? did you know your system tps/qps?
? How optimize is your code?
? Always measure first
? 拆分
? 拆什么:
? 从最大资源消耗,相对最独立的模块开始
? 你知道你的系统,最大资源消耗,是什么模块么?各自占了多少百分比的资源?
? 怎么拆:双写,还是API化先?
? 怎么验证:
? vip之路:购物车,库存,订单,用户,商品,。。。