提升CEPH PG scrub的速度

CEPH会定期(默认每个星期一次)对所有的PGs进行scrub,即通过检测PG中各个osds中数据是否一致来保证数据的安全。

当CEPH更换一块坏硬盘,进行数据修复后,出现了大量的PGs不能及时进行scrub,甚至有些PGs数据不一致,导致CEPH系统报警,如下所示:

  cluster:
    id:     8f1c1f24-59b1-11eb-aeb6-f4b78d05bf17
    health: HEALTH_ERR
            6 scrub errors
            Possible data damage: 5 pgs inconsistent
            1497 pgs not deep-scrubbed in time
            1466 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum ceph101,ceph103,ceph107,ceph109,ceph105 (age 5d)
    mgr: ceph107.gtrmmh(active, since 9d), standbys: ceph101.qpghiy
    mds: cephfs:3 {0=cephfs.ceph106.hggsge=up:active,1=cephfs.ceph104.zhkcjt=up:active,2=cephfs.ceph102.imxzno=up:active} 1 up:standby
    osd: 360 osds: 360 up (since 5d), 360 in (since 5d)

  data:
    pools:   4 pools, 10273 pgs
    objects: 331.80M objects, 560 TiB
    usage:   1.1 PiB used, 4.2 PiB / 5.2 PiB avail
    pgs:     10261 active+clean
             7     active+clean+scrubbing+deep
             5     active+clean+inconsistent

  io:
    client:   223 MiB/s rd, 38 MiB/s wr, 80 op/s rd, 12 op/s wr

此时,需要提高对PGs的scrub速度。默认情况下一个OSD近能同时进行一个scrub操作且仅当主机低于0.5时才进行scrub操作。我运维的CEPH系统一个PG对应10个OSDs,每个OSD仅能同时进行一个scrub操作,导致大量scrub操作需要等待,而同时进行的scrub操作数量一般为8个。因此需要在各台存储服务器上对其OSDs进行参数修改,来提高scrub速度。

osd_max_scrubs参数用于设置单个OSD同时进行的最大scrub操作数量;osd_scrub_load参数设置负载阈值。主要修改以上两个参数来提高scrub速度,其默认值为1和0.5。

例如,对ceph101主机中的所有OSDs进行参数修改:

首先,获取ceph101中所有的OSD信息:

ceph osd dump | grep `grep ceph101 /etc/hosts | perl -ne 'print $1 if m/(\d\S*)/'` | perl -ne 'print "$1\n" if m/(osd.\d+)/' > /tmp/osd.list

然后,对所有OSD的参数进行批量修改:

for i in `cat /tmp/osd.list`
do
    ceph tell $i injectargs --osd_max_scrubs=10 --osd_scrub_load_threshold=10
done

检查修改后的效果

for i in `cat /tmp/osd.list`
do
    echo $i
    ceph daemon $i config show | egrep "osd_max_scrubs|osd_scrub_load"
done

对所有的CEPH主机进行上述修改后,同时进行scrub的数量提高了30倍,并且使用top命令可以看到ceph-osd进程对CPU的资源消耗明显上升。

  cluster:
    id:     8f1c1f24-59b1-11eb-aeb6-f4b78d05bf17
    health: HEALTH_ERR
            6 scrub errors
            Possible data damage: 5 pgs inconsistent
            1285 pgs not deep-scrubbed in time
            1208 pgs not scrubbed in time
            1 slow ops, oldest one blocked for 37 sec, daemons [osd.124,osd.352] have slow ops.

  services:
    mon: 5 daemons, quorum ceph101,ceph103,ceph107,ceph109,ceph105 (age 5d)
    mgr: ceph107.gtrmmh(active, since 9d), standbys: ceph101.qpghiy
    mds: cephfs:3 {0=cephfs.ceph106.hggsge=up:active,1=cephfs.ceph104.zhkcjt=up:active,2=cephfs.ceph102.imxzno=up:active} 1 up:standby
    osd: 360 osds: 360 up (since 5d), 360 in (since 5d)

  data:
    pools:   4 pools, 10273 pgs
    objects: 331.80M objects, 560 TiB
    usage:   1.1 PiB used, 4.2 PiB / 5.2 PiB avail
    pgs:     10041 active+clean
             152   active+clean+scrubbing+deep
             75    active+clean+scrubbing
             3     active+clean+scrubbing+deep+inconsistent+repair
             2     active+clean+inconsistent

  io:
    client:   65 MiB/s rd, 4.2 MiB/s wr, 21 op/s rd, 2 op/s wr

此外,注意:(1)提高scrub的并行数可能对CEPH集群内网的网速要求较高,推荐使用10GE以上交换机。(2)条scrub的并行数,会导致一个OSD对应的硬盘同时并行读取的操作数量较高,当导致磁盘100%被使用时,可能磁盘的利用效率并不高,因此不推荐将 osd_max_scrubs参数调节到10以上。

对两个星期内(当前时间2022-08-03)未进行deep-scrubbed,已经有报警信息的PGs进行操作。

ceph pg dump | perl -e 'while (<>) { @_ = split /\s+/; $pg{$_[0]} = $1 if ($_[22] =~ m/2022-07-(\d+)/ && $1 <= 21); } foreach ( sort {$pg{$a} <=> $pg{$b}} keys %pg ) { print "ceph pg deep-scrub $_; sleep 30;\n"; }' > for_deep_scrub.list

sh for_deep_scrub.list