狠狠撸

Cephfs架构解读与测试分析
杨冠军

Agenda
? 颁别辫丑贵厂架构解读
? CephFS介绍
? CephFS使用
? CephFS认证
? CephFS FSCK & Repair
? 颁别辫丑贵厂测试
? 颁别辫丑贵厂测试环境
? 颁别辫丑贵厂测试目的与工具
? 颁别辫丑贵厂测试分析
? 总结与展望

颁别辫丑贵厂架构解读

CephFS介绍
? CephFS是Ceph提供的兼容POSIX协议的文件系统
? 对比RBD和RGW，它是Ceph最晚满足production ready的一个功能
? 底层还是使用RADOS存储数据
? 基本功能Ready，很多Features还是Experimental(Jewel)

CephFS介绍
? 可扩展性
? client读写OSDs
? 共享文件系统
? 多个clients可以同时读写
? 高可用
? MDSs集群，Aactive/Standby MDSs
? 文件/目录Layouts
? 支持配置文件/目录的Layouts使用不同的pool
? POSIX ACLs
? CephFS kernel client默认支持，CephFS FUSE client可配置支持
? Client Quotas
? CephFS FUSE client支持配置任何目录的Quotas

CephFS架构
? OSDs
? Monitors
? MDSs
? CephFS Kernel Object
? librados
? CephFS FUSE，CephFS Library

颁别辫丑贵厂相关组件

CephFS - MDS
? Dynamic subtree placement
? 目录分片级调度
? Traffic Control
? 热度负载均衡
? 客户端缓存“目录-MDS”映射
关系
元数据存储
? per-MDS journals
? Write to OSD cluster
? MetaData
? Write to OSD cluster

CephFS使用方式
CephFS client端：
1. CephFS Kernel module
? since 2.6.34
2. CephFS FUSE
client端访问CephFS流程
? client端与MDS节点通讯，获取
metadata信息（metadata也存
在osd上）
? client直接写数据到OSD

CephFS Client访问示例
1. Client发送open file请求给MDS
2. MDS返回file inode，file size，capability和stripe信息
3. Client直接Read/Write数据到OSDs
4. MDS管理file的capability
5. Client发送close file请求给MDS，释放file的capability，更新file
详细信息
? 没有分布式文件锁
? 多客户端访问文件的一致性通过文件的capability保证

CephFS使用
? 创建MDS Daemon
# ceph-deploy mds create <…>
? 创建CephFS Data Pool
# ceph osd pool create <…>
? 创建CephFS Metadata Pool
# ceph osd pool create <…>
? 创建CephFS
# ceph fs new <…>
? 查看CephFS
# ceph fs ls
name: tstfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
? 删除CephFS
# ceph fs rm <fs-name> --yes-i-really-mean-it

CephFS使用
? 查看MDS状态
# ceph mds stat
e8: tstfs-1/1/1 up tstfs2-0/0/1 up {[tstfs:0]=mds-daemon-1=up:active}
- e8 :
- e表示epoch，8是epoch号
- tstfs-1/1/1 up :
- tstfs是cephfs名字
- 三个1分别是 mds_map.in/mds_map.up/mds_map.max_mds
- up是cephfs状态
- {[tstfs:0]=mds-daemon-1=up:active} :
- [tstfs:0]指tstfs的rank 0
- mds-daemon-1是服务tstfs的mds daemon name
- up:active是cephfs的状态为 up & active

CephFS使用
? CephFS kernel client
? # mount -t ceph <monitor ip>:6789 /mntdir
? # umount /mntdir
? CephFS FUSE
? 安装ceph-fuse pkg
? # ceph-fuse -m <monitor ip>:6789 /mntdir
? # fusermount -u /mntdir
? centos7里没有fusermount命令，可以用umount替代
? 对比
? 性能：Kernel client > ceph-fuse
? Quota支持：只有ceph-fuse(client-side quotas)

CephFS Layout和file striping
? CephFS可以配置dir/file的layout和striping
? 保存在dir/file的xattr中
? 目录的layout xattrs为：ceph.dir.layout
? 文件的layout xattrs为：ceph.file.layout
? CephFS支持的layout配置项有
? pool - 数据存储到指定pool
? namespace - 数据存储到指定namespace(rbd/rgw/cephfs都还不支持)
? stripe_unit - 条带大小，单位Byte
? stripe_count - 条带个数
? 默认文件/目录继承父目录的layout和striping

CephFS Layout和file striping
# setfattr -n ceph.dir.layout -v "stripe_unit=524288 stripe_count=8 object_size=4194304
pool=cephfs_data2" /mnt/mike512K/

CephFS认证
? CephFS支持client端的authentication，来限制不同的用户访问不同的
目录，或者后端的pool
# ceph auth get-or-create client.*client_name*
mon 'allow r'
mds 'allow r, allow rw path=/*specified_directory*'
osd 'allow rw pool=data’
? 前提：开启Ceph集群的认证
? 配置ceph.conf
# vim /etc/ceph/ceph.conf
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

CephFS认证
? 创建auth client
# ceph auth get-or-create client.tst1 mon ‘allow r’ mds ‘allow r,
allow rw path=/tst1’ osd ‘allow rw pool=cephfs_data'
- mon ‘allow r’
允许user从monitor读取数据；必须配置
- mds ‘allow r, allow rw path=/tst1’
允许user从mds读取数据，允许user对目录/tst1读写；
其中‘ allow r’必须配置，不然user不能从mds读取数据，mount会报
permission error；
- osd ‘allow rw pool=cephfs_data’
允许user从osd pool=cephfs_data 上读写数据；
若不配置，用户只能从mds上获取FS的元数据信息，没法查看各个文件的数据；

CephFS认证
? 检查ceph auth
# ceph auth get client.tst1
exported keyring for client.tst1
[client.tst]
key = AQCd+UBZxpi4EBAAUNyBDGdZbPgfd4oUb+u41A==
caps mds = allow r, allow rw path=/tst1"
caps mon = "allow r"
caps osd = "allow rw pool=cephfs_data"
? Mount 测试
# mount -t ceph <ip>:6789:/tst1 /mnt -o name=tst1,secret=AQCd+UBZxpi4EBAAUNyBDGdZbPgfd4oUb+u41A==
? 认证还不完善
? 上述client.tst1可以mount整个CephFS目录，能看到并读取整个CephFS的文件
# mount -t ceph <ip>:6789:/ /mnt -o name=tst1,secret=AQCd+UBZxpi4EBAAUNyBDGdZbPgfd4oUb+u41A==
? 没找到能支持readonly访问某一目录的方法
只验证了cephfs kernel client，没试过ceph-fuse的认证

CephFS FSCK & Repair
? Jewel版本提供了CephFS的scrub/repair工具
? 能处理大部分的元数据损坏
? 修复命令慎重执行，需要专业人士
? 若可以请导出元数据，做好备份
? cephfs-journal-tool
? inspect/import/export/reset
? header get/set
? event get/apply/recover_dentries/splice

CephFS FSCK & Repair
? Online check/scrub
- ceph tell mds.<id> damage ls
- ceph tell mds.<id> damage rm <int>
- scrub an inode and output results
# ceph mds mds.<id> scrub_path <path> {force|recursive|repair [force|recursive|repair...]}
? Offline repair
- cephfs-data-scan init [--force-init]
- cephfs-data-scan scan_extents [--force-pool] <data pool name>
- cephfs-data-scan scan_inodes [--force-pool] [--force-corrupt] <data pool name>
- cephfs-data-scan scan_frags [--force-corrupt]
- cephfs-data-scan tmap_upgrade <metadata_pool>

颁别辫丑贵厂测试目的
? CephFS POSIX基本功能完备？
? CephFS性能跑满整个集群？
? CephFS长时间是否稳定？
? CephFS能否应对MDS异常？
NO：
? 不是针对MDS的参数调优
? 不是MDS的压力测试
? MDS压力测试时建议配置在单独的机器上
? 调大 mds_cache_size
MDS压力测试请参考： /XiaoxiChen3/cephfs-jewel-mds-performance-benchmark

CephFS - Jewel
? Single Active MDS，Active-Standby MDSs
? Single CephFS within a single Ceph Cluster
? CephFS requires at least kernel 3.10.x
? CephFS – Production Ready
? Experimental Features
? Multi Active MDSs
? Multiple CephFS file systems within a single Ceph Cluster
? Directory Fragmentation

颁别辫丑贵厂测试环境
? 三台物理机搭建 Ceph集群
? 每台物理机上10个4T 7200RPM SATA盘+两个480GB的SATA SSD盘，每个SSD盘
分出5个20GB的分区做5个OSD的journal
? 两个万兆网卡，分别配置为public/cluster network
? SSD盘型号为：Intel S3500系列，其性能指标为：
? 普通7200转SATA盘性能：
? 顺序读写约 120MB/s
? IOPS约为 130

颁别辫丑贵厂测试环境
? Cephfs client单独机器，万兆网络连接ceph
? 配置replica=3
? MDS配置为Active/Standby
? Ceph版本和测试机OS为：
预估整个Ceph集群的性能
Ceph的部署架构图

颁别辫丑贵厂测试工具
? 功能测试：手动，fstest
? 性能测试：dd，fio，iozone，filebench
? 稳定性测试：fio，iozone，自写脚本
? 异常测试：手动

颁别辫丑贵厂测试分析-功能测试
? 测试方法：手动，fstest
? 手动：mkdir/cd/touch/echo/cat/chmod/chown/mv/ln/rm等
? fstest：一套简化版的文件系统POSIX兼容性测试套件
? 目前有3601个回归测试
? 测试的系统调用覆盖chmod, chown, link, mkdir, mkfifo, open,
rename, rmdir, symlink, truncate, unlink
? 总结：功能测试通过

颁别辫丑贵厂测试分析-性能测试
? 测试方法：dd，fio，iozone，filebench
? 测试分类：CephFS分为三类stripe配置
1. stripe_unit=1M, stripe_count=4, object_size=4M (目录为: dir-1M-4-4M)
2. stripe_unit=4M, stripe_count=1, object_size=4M (目录为: dir-4M-1-4M，默认)
3. stripe_unit=4M, stripe_count=4, object_size=64M (目录为: dir-4M-4-64M)
? 配置CephFS stripe
? 文件默认继承父目录的attributes
? 配置测试目录的attr
例如：# setfattr -n ceph.dir.layout -v "stripe_unit= 1048576 stripe _count=4
object_size=4194304" dir-1M-4-4M
注：每轮测试前清空client端缓存

颁别辫丑贵厂测试分析-性能测试-dd
? 测试命令
? Direct IO： oflag/iflag=direct
? Sync IO：oflag/iflag=sync
? Normal IO：不指定oflag/iflag
? 测试文件大小：20G
? 不能选择太小的测试文件，减少系统缓存的影响

颁别辫丑贵厂测试分析-性能测试-dd
Normal IO：客户端缓存影响，性能较高，不分析
Direct IO：写性能只有 150MB/s，读性能只有 600MB/s（cephfs kernel client端IO实现导致）
Sync IO：随着bs增大性能提升，写性能有 550MB/s，读性能有1GB/s
Stripe模式变化：
1. bs=512k/1M时，各个stripe模
式下的IO性能基本相同
2. bs=4M/16M时
? Direct IO时stripe
unit=1M的条带性能略低
? Sync IO时stripe
unit=1M的条带性能较好
3. 默认的file layout(橙色)，
dd的性能就挺好，64Mobjcet
的stripe模式(灰色)没有明显
的性能提升

颁别辫丑贵厂测试分析-性能测试-fio
? 固定配置
-filename=tstfile 指定测试文件的name
-size=20G 指定测试文件的size为20G
-direct=1 指定测试IO为DIRECT IO
-thread 指定使用thread模式
-name=fio-tst-name 指定job name
? 测试bandwidth时
-ioengine=libaio/sync
-bs=512k/1M/4M/16M
-rw=write/read
-iodepth=64 –iodepth_batch=8 –iodepth_batch_complete=8
? 测试iops时
-ioengine=libaio
-bs=4k
-runtime=300
-rw=randwrite/randread
-iodepth=64 -iodepth_batch=1 -iodepth_batch_complete=1

Direct sync IO：性能有限，与dd测试结果一致
Direct libaio：写性能有 810MB/s，读性能有 1130MB/s，是集群的极限
都是大文件测试，与dd
测试结果一致

IOPS： cephfs stripe对iops影响不大，写为 4200，读为 2400
? randread中，因为有cephfs这一层，所以即使direct IO，在OSD上也不一定会read磁
盘，因为OSD有缓存数据。
? 所以每次测试前要在所有ceph cluster的host上清理缓存。
sync; echo 3 > /proc/sys/vm/drop_caches;
Io mode type dir-1M-4-4M dir-4M-1-4M dir-4M-4-64M
randwrite iops 4791 4172 4130
Latency(ms) 13.35 15.33 15.49
randread iops 2436 2418 2261
Latency(ms) 26.26 26.46 28.30

颁别辫丑贵厂测试分析-性能测试-iozone
? 测试DIRET IO / SYNC IO - 非throughput模式
? 不指定threads，测试单个线程的iozone性能
# iozone -a -i 0 -i 1 -i 2 -n 1m -g 10G -y 128k -q 16m -I -Rb iozone-directio-
output.xls
# iozone -a -i 0 -i 1 -i 2 -n 1m -g 10G -y 128k -q 16m -o -Rb iozone-syncio-
output.xls
? 测试系统吞吐量 - throughput模式
? 指定threads=16，获取整个系统的throughput
# iozone -a -i 0 -i 1 -i 2 -r 16m -s 2G -I -t 16 -Rb iozone-directio-
throughput-output.xls
# iozone -a -i 0 -i 1 -i 2 -r 16m -s 2G -o -t 16 -Rb iozone-syncio-throughput-
output.xls

? 非Throughput模式性能
? 写性能：direct IO模式为 150 MB/s，sync IO模式为 350MB/s
? 读性能：direct IO模式为 560 MB/s，sync IO模式为 7000 MB/s
（ iozone的io模式和client端缓存的影响，指标不准确）
1. 各个stripe下性能
基本一致
2. 小文件的小IO模式
下，dir-1M-4-4M的
性能略好些

? Throughput模式性能
1. 各种write的性能基本相同，最大约为 750 MB/s，基本是集群写的极限
2. direct IO模式下，读性能约为 1120 MB/s，client端万兆网络带宽的极限
3. sync IO模式下，读性能高达 22500 MB/s，iozone的io模式和client端缓存
的影响，指标不准确

颁别辫丑贵厂测试分析-性能测试-filebench
? filebench 是一款文件系统性能的自动化测试工具，它通过快速模拟真实应用服务器的负载
来测试文件系统的性能
? filebench有很多定义好的workload
? 详细参考：http://www.yangguanjun.com/2017/07/08/fs-testtool-filebench/
? 针对cephfs的测试，选择其中一部分有代表性的workloads即可
? createfiles.f / openfiles.f / makedirs.f / listdirs.f / removedirs.f
? randomrw.f / fileserver.f / videoserver.f / webserver.f
? 结论
1. filebench测试用例，除了读写操作外，其他的都是元数据操作，基本不受cephfs stripe的影响
2. 各种文件操作的时延都不高，可以满足基本的对filesystem的需求

颁别辫丑贵厂测试分析-稳定性测试
? 读写数据模式
? 选择工具fio
# fio循环测试读写
while now < time
fio write 10G file
fio read 10G
file delete file
? 读写元数据模式
? 采用自写脚本，大规模创建目录、文件、写很小数据到文件中
# 百万级别的文件个数
while now < time
create dirs
touch files
write little data to each file
delete files
delete dirs

颁别辫丑贵厂测试分析-稳定性测试
? 结论
? 几天的连续测试，CephFS一切正常
? 在上亿级别小文件的测试中，有些问题
? 问题与解决
? 日志中报“Behind on trimming”告警
调整参数 mds_log_max_expiring，mds_log_max_segments
? rm删除上亿文件时报“No space left on device”错误
调大参数 mds_bal_fragment_size_max，mds_max_purge_files，mds_max_purge_ops_per_pg
? 日志中报“_send skipping beacon, heartbeat map not healthy”
调大参数 mds_beacon_grace，mds_session_timeout，mds_reconnect_timeout
MDS log信息 -> 搜索相关Ceph代码 -> 分析原因 -> 调整参数

颁别辫丑贵厂测试分析-异常测试
? 主从MDS
? 单MDS
? 启停MDS service的命令
# systemctl stop ceph-mds.target
# systemctl start ceph-mds.target
? 相关配置参数
OPTION(mds_tick_interval, OPT_FLOAT, 5)
OPTION(mds_mon_shutdown_timeout, OPT_DOUBLE, 5)
OPTION(mds_op_complaint_time, OPT_FLOAT, 30)
? CephFS允许客户端缓存metadata 30s
? 所以选择测试MDS stop/start的时间间隔取为：2s，10s，60s
? 测试工具：fio

颁别辫丑贵厂测试分析-异常测试
? 单MDS时：
? 2s/10s 无影响
? 60s时影响IO
? 主从MDS时：
? 主从不同时停无影响
? 同时停时与单MDS一致
? fio测试结果如右图
? mds停60s会影响IO
? 结论：
? 主从MDS更可靠
? 主从切换不影响元数据的一致性

总结
1.CephFS是production ready的，能满足基本生产环境对文件存储的需求
2.CephFS kernel client端的Linux kernel版本最好大于4.5-rc1（支持aio）
3.对性能要求不高时，考虑使用CephFS FUSE client，支持Quotas
4.CephFS的主从MDS是稳定的，优于单MDS配置
5.生成环境使用CephFS时，独立机器上配置MDS，调大“mds_cache_size”
6.使用CephFS时，避免单个目录下包含超级多文件（more than millions）

总结
7. CephFS能跑满整个ceph集群的性能
8. 默认stripe模式下(stripe unit=4M, stripe count=1, object size=4M)，
CephFS的性能就挺好
9. 小文件的应用场景下，尝试配置小的stripe unit，对比默认stripe的性能
10.CephFS的Direct IO性能有限，分析后是cephfs kernel client的IO处理逻
辑限制的(http://www.yangguanjun.com/2017/06/26/cephfs-dd-direct-io-tst-analysis/)
11.受到CephFS client端的系统缓存影响，非Direct IO的读写性能都会比较高，
这个不具有太大参考意
12.使用CephFS kernel client，且object size大于16M时，一次性读取大于
16M的数据读时IO会hang住(http://www.yangguanjun.com/2017/07/18/cephfs-io-hang-analysis/)

展望 – Ceph Luminous
? Ceph Luminous (v12.2.0) - next long-term stable release series
1.The new BlueStore backend for ceph-osd is now stable and the new
default for newly created OSDs
2.Multiple active MDS daemons is now considered stable
3.CephFS directory fragmentation is now stable and enabled by default
4.Directory subtrees can be explicitly pinned to specific MDS daemons

蚕耻别蝉迟颈辞苍蝉？

狠狠撸

颁别辫丑蹿蝉架构解读和测试分析

Recommended

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to 颁别辫丑蹿蝉架构解读和测试分析 (20)

颁别辫丑蹿蝉架构解读和测试分析