狠狠撸

狠狠撸Share a Scribd company logo
CEPH RDMA UPDATE
XSKY Haomai Wang
2017.06.06
Ceph Network Evolvement
? The History of Messenger
– SimpleMessenger
– XioMessenger
– AsyncMessenger
Ceph Network Evolvement
Ceph Network Evolvement
? AsyncMessenger
– Core Library included by all components
– Kernel TCP/IP driver
– Epoll/Kqueue Drive
– Maintain connection lifecycle and session
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socket)
– scales better to larger clusters
– more healthy relationship with tcmalloc
– now the default!
Ceph Network Evolvement
? Performance Bottleneck:
– Non Local Process of Connections
? RX in interrupt context
? Application and system call in another
– Global TCP Control Block Management
– VFS Overhead
– TCP protocol optimized for:
? Throughput, not latency
? Long-haul networks (high latency)
? Congestion throughout
? Modest connections/server
Ceph Network Evolvement
? Hardware Assistance
– SolarFlare(TCP Offload)
– RDMA(Infiniband/RoCE)
– GAMMA(Genoa Active Messange Machine)
? Data Plane
– DPDK + Userspace TCP/IP Stack
? Linux Kernel Improvement
? TCP or NonTCP
? Pros:
– Compatible
– Proved
? Cons:
– Complexity
? Notes:
– Try lower latency and scalability but no need to do extremely
Ceph Network Evolvement
? Built for High Performance
– DPDK
– SPDK
– Full userspace IO path
– Shared-nothing TCP/IP Stack(Seastar refer)
Ceph Network Evolvement
? Problems
– OSD Design
? Each OSD own one disk
? Pipeline model
? Too much lock/wait in legacy
– DPDK + SPDK
? Must run on nvme ssd
? CPU spining
? Limited use cases
CEPH RDMA support
? RDMA backend
– Inherit NetworkStack and implement RDMAStack
– Using user-space verbs directly
– TCP as control path
– Exchange message using RDMA SEND
– Using shared receive queue
– Multiple connection qp’s in many-to-many topology
– Built-in into ceph master
– All Features are fully avail on ceph master
? Support:
– RH/centos
– INFINIBAND and ETH
– Roce V2 for cross subnet
– Front-end TCP and back-end RDMA
RDMA Support
网络引擎 默认引擎 硬件要求 性能 兼容性 OSD 存储引
擎要求
OSD存储介质
要求
Posix(Kernel) 是 无 中 兼容任何TCP/IP网
络
无 无
DPDK+Userspace
TCP/IP
否 支持DPDK的
网卡
高 兼容任何TCP/IP网
络
BlueStore 必须使用
NVME SSD
RDMA 否 支持RDMA
的网卡
高 支持RDMA
的网络
无 无
RDMA Support
? RDMA-VERBS
– Native RDMA Support
– Exchange Connection Via TCP/IP
? RDMA-CM:
– Provide a simper abstraction over verbs
– Required by iWarp
– Functionality is carried forward
RDMA Support
RDMA Support
? Usages
– QEMU/KVM
– NBD
– FUSE
– S3/Swift ObjectStorage
– All ceph ecosystem
RDMA Support
? Work in progress:
– RDMA-CM for control path
? Support multiple devices
? Enable unified ceph.conf for all ceph nodes
– Ceph replication Zero-copy
? Reduce number of memcpy by half by re-using data
buffers on primary OSD
– Tx zero-copy
? Avoid copy out by using reged memory
? ToDo:
– Use RDMA READ/WRITE for better memory utilization
– ODP – On demand paging
– Erasure-coding using HW offload
– Performance isn’t enough
微信公众号
了解XSKY 最新资讯,产物,
信息,公司级解决方案,
参加线上活动,请关注此
公司官方微信公众号。
XSKY 微信公众号
豪迈面向Ceph社区与开源
爱好者,总结Ceph社区每
周开发进展的最新资讯,
更加偏重于研发与方向。
Ceph开发每周谈
为公司级存储解决方案量
身打造,结合福叔在数据
存储与管理方面多年的经
验,推荐业内以及公司级
存储运维人员关注。
福叔讲存储
Thank you

More Related Content

Ceph Day Beijing - Ceph RDMA Update