Revisiting CephFS MDS and mClock QoS SchedulerYongseok Oh
?
The document presents an overview of CephFS and its metadata server (MDS) evaluation, focusing on aspects such as scalability, performance of kernel versus FUSE clients, and the impact of MDS cache sizes. It also discusses the MClock QoS scheduler and various configurations for testing, including recovery times and the effects of subtree pinning. Overall, it highlights findings on performance metrics and technical challenges within CephFS.
Tech Talk: RocksDB 狠狠撸s by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
?
The document discusses RocksDB, an embedded key-value store optimized for fast storage, emphasizing its architecture, performance improvements over previous databases like LevelDB, and features such as pluggable components and efficient compaction strategies. It highlights RocksDB's ability to substantially reduce write amplification, improve read efficiency with bloom filters, and support various storage backends. Additionally, it outlines practical use cases and optimizations that cater to specific application requirements, underscoring its suitability for server workloads.
The document summarizes new features and updates in Ceph's RBD block storage component. Key points include: improved live migration support using external data sources; built-in LUKS encryption; up to 3x better small I/O performance; a new persistent write-back cache; snapshot quiesce hooks; kernel messenger v2 and replica read support; and initial RBD support on Windows. Future work planned for Quincy includes encryption-formatted clones, cache improvements, usability enhancements, and expanded ecosystem integration.
The document discusses improving UDP transaction performance, focusing on techniques used in high-bandwidth environments, particularly with 10G networks. It covers key technologies like Receive Side Scaling (RSS) that enhance network processing by distributing packets among multiple CPU cores, and presents performance metrics from experiments using a multi-threaded echo server. Additionally, it provides insights on analyzing and fine-tuning system parameters to identify and resolve bottlenecks in UDP transaction processing.
Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit
?
MetaScale is a subsidiary of Sears Holdings Corporation that provides big data technology solutions and services focused on Hadoop. It helped Sears implement a real-time inventory tracking system using Hadoop and Cassandra to create a single version of inventory data across different legacy systems. This allowed inventory levels to be updated in real-time from POS data, reducing out-of-stocks and improving the customer experience.
This talk discusses Linux profiling using perf_events (also called "perf") based on Netflix's use of it. It covers how to use perf to get CPU profiling working and overcome common issues. The speaker will give a tour of perf_events features and show how Netflix uses it to analyze performance across their massive Amazon EC2 Linux cloud. They rely on tools like perf for customer satisfaction, cost optimization, and developing open source tools like NetflixOSS. Key aspects covered include why profiling is needed, a crash course on perf, CPU profiling workflows, and common "gotchas" to address like missing stacks, symbols, or profiling certain languages and events.
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxDataInfluxData
?
The document is a training presentation on InfluxDB and time-series data aimed at engineering managers. It covers concepts such as the InfluxDB data model, schema design, and querying data using various languages. Participants will learn how to manage and effectively utilize time-series data within the InfluxDB platform.
The document provides an overview of RocksDB, an open-source log-structured merge (LSM) database optimized for performance and reliability in backend services. It details key differences between LSM and B+ tree architectures, performance practices, and reliability strategies including handling write stalls and managing tombstones. The insights drawn from years of production experience highlight the trade-offs between write efficiency and read performance, along with best practices for database operations.
DigitalOcean uses Ceph for block and object storage backing for their cloud services. They operate 37 production Ceph clusters running Nautilus and one on Luminous, storing over 54 PB of data across 21,500 OSDs. They deploy and manage Ceph clusters using Ansible playbooks and containerized Ceph packages, and monitor cluster health using Prometheus and Grafana dashboards. Upgrades can be challenging due to potential issues uncovered and slow performance on HDD backends.
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
?
The document discusses Bluestore, a new Ceph OSD backend, contrasting it with the previous Filestore and highlighting its improved performance, reliability, and scalability. It details the underlying architecture, including key features such as the use of RocksDB for metadata, direct data writes to block devices, and a pluggable block allocator. Additionally, it addresses ongoing challenges and improvements that Bluestore aims to provide over Filestore, especially in the management of object storage and transaction processing.
Building an open data platform with apache icebergAlluxio, Inc.
?
The document outlines the development of an open data platform using Apache Iceberg, which serves as a table format for analytic data, providing transactional guarantees and performance enhancements. It highlights the need for a multi-engine architecture that leverages various tools like Spark, Trino, and Flink, while addressing usability and productivity issues within data management. Key goals of Iceberg include improving transactions, performance, and usability through features like schema evolution and reliable updates.
The document provides an overview of Ceph, a distributed storage system that supports object, block, and file storage in a single scalable cluster. It discusses the CRUSH (Controlled, Scalable Hashing) algorithm used for data placement and replication, highlighting key components such as OSDs, monitors, and the CRUSH hierarchy and rules. Additionally, it covers the flexibility and performance improvements achieved through effective data management, recovery processes, and the handling of failures within the cluster.
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
?
The document presents an overview of storage performance testing in Kubernetes, focusing on the Sherlock tool developed by Sagy Volkov from Lightbits Labs. Sherlock supports multiple databases and workloads, facilitating real-world performance evaluations beyond standard I/O tests. It employs nvme/tcp for efficient performance and is designed to streamline the benchmarking process for new Kubernetes users.
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangCeph Community
?
This document discusses using Linux block caching with Ceph BlueStore. It explains that BlueStore can better utilize fast storage devices like SSDs compared to FileStore. It tested using Bcache and DM-writeboost to cache BlueStore data on HDDs using SSDs. Bcache performed better overall. Issues found were slow requests when caching and BlueStore used the same SSD, and inconsistency in SSD data management between BlueStore and the cache. Future work could have BlueStore control all raw disks and prioritize data saving to fast devices.
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
?
This document discusses an all-flash Ceph array design from QCT based on NUMA architecture. It provides an agenda that covers all-flash Ceph and use cases, QCT's all-flash Ceph solution for IOPS, an overview of QCT's lab environment and detailed architecture, and the importance of NUMA. It also includes sections on why all-flash storage is used, different all-flash Ceph use cases, QCT's IOPS-optimized all-flash Ceph solution, benefits of using NVMe storage, QCT's lab test environment, Ceph tuning recommendations, and benefits of using multi-partitioned NVMe SSDs for Ceph OSDs.
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
?
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
?
The document provides a comprehensive overview of Hive, a data warehouse system designed for querying and analyzing large datasets stored in HDFS using a SQL-like language called HiveQL. It covers the history of Hive, its architecture, data modeling, data types, operational modes, and key differences between Hive and traditional RDBMS systems. Additionally, it highlights various features of Hive that facilitate data processing and analysis.
(狠狠撸s) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
?
The document presents a task scheduling algorithm designed for multicore processors to minimize recovery time in case of a single node failure. It introduces a new scheduling method that accounts for network contention and multicore processor failure, aiming to improve upon existing methods that overlook these factors. Testing results show that the proposed method significantly reduces execution time during failures by optimizing task distribution across processor dies.
This talk discusses Linux profiling using perf_events (also called "perf") based on Netflix's use of it. It covers how to use perf to get CPU profiling working and overcome common issues. The speaker will give a tour of perf_events features and show how Netflix uses it to analyze performance across their massive Amazon EC2 Linux cloud. They rely on tools like perf for customer satisfaction, cost optimization, and developing open source tools like NetflixOSS. Key aspects covered include why profiling is needed, a crash course on perf, CPU profiling workflows, and common "gotchas" to address like missing stacks, symbols, or profiling certain languages and events.
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxDataInfluxData
?
The document is a training presentation on InfluxDB and time-series data aimed at engineering managers. It covers concepts such as the InfluxDB data model, schema design, and querying data using various languages. Participants will learn how to manage and effectively utilize time-series data within the InfluxDB platform.
The document provides an overview of RocksDB, an open-source log-structured merge (LSM) database optimized for performance and reliability in backend services. It details key differences between LSM and B+ tree architectures, performance practices, and reliability strategies including handling write stalls and managing tombstones. The insights drawn from years of production experience highlight the trade-offs between write efficiency and read performance, along with best practices for database operations.
DigitalOcean uses Ceph for block and object storage backing for their cloud services. They operate 37 production Ceph clusters running Nautilus and one on Luminous, storing over 54 PB of data across 21,500 OSDs. They deploy and manage Ceph clusters using Ansible playbooks and containerized Ceph packages, and monitor cluster health using Prometheus and Grafana dashboards. Upgrades can be challenging due to potential issues uncovered and slow performance on HDD backends.
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
?
The document discusses Bluestore, a new Ceph OSD backend, contrasting it with the previous Filestore and highlighting its improved performance, reliability, and scalability. It details the underlying architecture, including key features such as the use of RocksDB for metadata, direct data writes to block devices, and a pluggable block allocator. Additionally, it addresses ongoing challenges and improvements that Bluestore aims to provide over Filestore, especially in the management of object storage and transaction processing.
Building an open data platform with apache icebergAlluxio, Inc.
?
The document outlines the development of an open data platform using Apache Iceberg, which serves as a table format for analytic data, providing transactional guarantees and performance enhancements. It highlights the need for a multi-engine architecture that leverages various tools like Spark, Trino, and Flink, while addressing usability and productivity issues within data management. Key goals of Iceberg include improving transactions, performance, and usability through features like schema evolution and reliable updates.
The document provides an overview of Ceph, a distributed storage system that supports object, block, and file storage in a single scalable cluster. It discusses the CRUSH (Controlled, Scalable Hashing) algorithm used for data placement and replication, highlighting key components such as OSDs, monitors, and the CRUSH hierarchy and rules. Additionally, it covers the flexibility and performance improvements achieved through effective data management, recovery processes, and the handling of failures within the cluster.
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
?
The document presents an overview of storage performance testing in Kubernetes, focusing on the Sherlock tool developed by Sagy Volkov from Lightbits Labs. Sherlock supports multiple databases and workloads, facilitating real-world performance evaluations beyond standard I/O tests. It employs nvme/tcp for efficient performance and is designed to streamline the benchmarking process for new Kubernetes users.
Linux Block Cache Practice on Ceph BlueStore - Junxin ZhangCeph Community
?
This document discusses using Linux block caching with Ceph BlueStore. It explains that BlueStore can better utilize fast storage devices like SSDs compared to FileStore. It tested using Bcache and DM-writeboost to cache BlueStore data on HDDs using SSDs. Bcache performed better overall. Issues found were slow requests when caching and BlueStore used the same SSD, and inconsistency in SSD data management between BlueStore and the cache. Future work could have BlueStore control all raw disks and prioritize data saving to fast devices.
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
?
This document discusses an all-flash Ceph array design from QCT based on NUMA architecture. It provides an agenda that covers all-flash Ceph and use cases, QCT's all-flash Ceph solution for IOPS, an overview of QCT's lab environment and detailed architecture, and the importance of NUMA. It also includes sections on why all-flash storage is used, different all-flash Ceph use cases, QCT's IOPS-optimized all-flash Ceph solution, benefits of using NVMe storage, QCT's lab test environment, Ceph tuning recommendations, and benefits of using multi-partitioned NVMe SSDs for Ceph OSDs.
YugaByte DB Internals - Storage Engine and Transactions Yugabyte
?
This document introduces YugaByte DB, a high-performance, distributed, transactional database. It is built to scale horizontally on commodity servers across data centers for mission-critical applications. YugaByte DB uses a transactional document store based on RocksDB, Raft-based replication for resilience, and automatic sharding and rebalancing. It supports ACID transactions across documents, provides APIs compatible with Cassandra and Redis, and is open source. The architecture is designed for high performance, strong consistency, and cloud-native deployment.
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
?
The document provides a comprehensive overview of Hive, a data warehouse system designed for querying and analyzing large datasets stored in HDFS using a SQL-like language called HiveQL. It covers the history of Hive, its architecture, data modeling, data types, operational modes, and key differences between Hive and traditional RDBMS systems. Additionally, it highlights various features of Hive that facilitate data processing and analysis.
(狠狠撸s) Task scheduling algorithm for multicore processor system for minimiz...Naoki Shibata
?
The document presents a task scheduling algorithm designed for multicore processors to minimize recovery time in case of a single node failure. It introduces a new scheduling method that accounts for network contention and multicore processor failure, aiming to improve upon existing methods that overlook these factors. Testing results show that the proposed method significantly reduces execution time during failures by optimizing task distribution across processor dies.
GLOA:A New Job Scheduling Algorithm for Grid ComputingLINE+
?
The document presents GLOA, a new job scheduling algorithm for grid computing aimed at optimizing resource allocation by minimizing computation time and makespans. It discusses the problem space, simulation results, and the algorithm's approach, which involves leveraging social dynamics within groups to escape local minima. The conclusion emphasizes GLOA's potential real-world applications and reduced overhead on resources.
This document serves as an introduction to YARN and MapReduce 2, highlighting the course objectives targeted towards developers, data analysts, and system administrators. It explains the differences between MapReduce 1 and 2, the architecture of YARN, resource management, and how to manage a YARN cluster. Additionally, it covers the components involved in running applications on YARN, along with fault tolerance mechanisms and the integration of various applications within the YARN framework.
The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.
Ceph is a distributed storage system that supports block, object, and file storage, founded by Sage Weil. It features high scalability, reliability, and performance, utilizing components like ceph-mon, ceph-mgr, and ceph-osd. The document outlines Ceph's architecture, usage, and components, providing insights into its operation and comparison with other storage systems.
Pegasus: Designing a Distributed Key Value System (Arch summit beijing-2016)涛 吴
?
Pegasus is a high-performance, highly available, and strongly consistent distributed KV storage system developed by Xiaomi, addressing the limitations of existing systems like HBase. The design choices focus on using C++ for better performance, a shared commit log for improved data consistency, and features like automatic failover and flexible data modeling. Pegasus aims to ensure high availability, optimized performance, and an easy-to-use interface while supporting extensive scalability for massive workloads.
23. CephFS - Jewel
? Single Active MDS,Active-Standby MDSs
? Single CephFS within a single Ceph Cluster
? CephFS requires at least kernel 3.10.x
? CephFS – Production Ready
? Experimental Features
? Multi Active MDSs
? Multiple CephFS file systems within a single Ceph Cluster
? Directory Fragmentation
38. 颁别辫丑贵厂测试分析-稳定性测试
? 读写数据模式
? 选择工具fio
# fio循环测试读写
while now < time
fio write 10G file
fio read 10G
file delete file
? 读写元数据模式
? 采用自写脚本,大规模创建目录、文件、写很小数据到文件中
# 百万级别的文件个数
while now < time
create dirs
touch files
write little data to each file
delete files
delete dirs
39. 颁别辫丑贵厂测试分析-稳定性测试
? 结论
? 几天的连续测试,CephFS一切正常
? 在上亿级别小文件的测试中,有些问题
? 问题与解决
? 日志中报“Behind on trimming”告警
调整参数 mds_log_max_expiring,mds_log_max_segments
? rm删除上亿文件时报“No space left on device”错误
调大参数 mds_bal_fragment_size_max,mds_max_purge_files,mds_max_purge_ops_per_pg
? 日志中报“_send skipping beacon, heartbeat map not healthy”
调大参数 mds_beacon_grace,mds_session_timeout,mds_reconnect_timeout
MDS log信息 -> 搜索相关Ceph代码 -> 分析原因 -> 调整参数
45. 展望 – Ceph Luminous
? Ceph Luminous (v12.2.0) - next long-term stable release series
1.The new BlueStore backend for ceph-osd is now stable and the new
default for newly created OSDs
2.Multiple active MDS daemons is now considered stable
3.CephFS directory fragmentation is now stable and enabled by default
4.Directory subtrees can be explicitly pinned to specific MDS daemons