Ceph is evolving its network stack to improve performance. It is moving from AsyncMessenger to using RDMA for better scalability and lower latency. RDMA support is now built into Ceph and provides native RDMA using verbs or RDMA-CM. This allows using InfiniBand or RoCE networks with Ceph. Work continues to fully leverage RDMA for features like zero-copy replication and erasure coding offload.
3. ? The History of Messenger
– SimpleMessenger
– XioMessenger
– AsyncMessenger
Ceph Network Evolvement
4. Ceph Network Evolvement
? AsyncMessenger
– Core Library included by all components
– Kernel TCP/IP driver
– Epoll/Kqueue Drive
– Maintain connection lifecycle and session
– replaces aging SimpleMessenger
– fixed size thread pool (vs 2 threads per socket)
– scales better to larger clusters
– more healthy relationship with tcmalloc
– now the default!
5. Ceph Network Evolvement
? Performance Bottleneck:
– Non Local Process of Connections
? RX in interrupt context
? Application and system call in another
– Global TCP Control Block Management
– VFS Overhead
– TCP protocol optimized for:
? Throughput, not latency
? Long-haul networks (high latency)
? Congestion throughout
? Modest connections/server
6. Ceph Network Evolvement
? Hardware Assistance
– SolarFlare(TCP Offload)
– RDMA(Infiniband/RoCE)
– GAMMA(Genoa Active Messange Machine)
? Data Plane
– DPDK + Userspace TCP/IP Stack
? Linux Kernel Improvement
? TCP or NonTCP
? Pros:
– Compatible
– Proved
? Cons:
– Complexity
? Notes:
– Try lower latency and scalability but no need to do extremely
7. Ceph Network Evolvement
? Built for High Performance
– DPDK
– SPDK
– Full userspace IO path
– Shared-nothing TCP/IP Stack(Seastar refer)
8. Ceph Network Evolvement
? Problems
– OSD Design
? Each OSD own one disk
? Pipeline model
? Too much lock/wait in legacy
– DPDK + SPDK
? Must run on nvme ssd
? CPU spining
? Limited use cases
10. ? RDMA backend
– Inherit NetworkStack and implement RDMAStack
– Using user-space verbs directly
– TCP as control path
– Exchange message using RDMA SEND
– Using shared receive queue
– Multiple connection qp’s in many-to-many topology
– Built-in into ceph master
– All Features are fully avail on ceph master
? Support:
– RH/centos
– INFINIBAND and ETH
– Roce V2 for cross subnet
– Front-end TCP and back-end RDMA
RDMA Support
12. ? RDMA-VERBS
– Native RDMA Support
– Exchange Connection Via TCP/IP
? RDMA-CM:
– Provide a simper abstraction over verbs
– Required by iWarp
– Functionality is carried forward
RDMA Support
14. RDMA Support
? Work in progress:
– RDMA-CM for control path
? Support multiple devices
? Enable unified ceph.conf for all ceph nodes
– Ceph replication Zero-copy
? Reduce number of memcpy by half by re-using data
buffers on primary OSD
– Tx zero-copy
? Avoid copy out by using reged memory
? ToDo:
– Use RDMA READ/WRITE for better memory utilization
– ODP – On demand paging
– Erasure-coding using HW offload
– Performance isn’t enough