[lsh@ceph2401 ~]$ sudo cephadm shell
[sudo] password for lsh:
Inferring fsid f2d0cd6e-8e43-11f0-aa90-a036bcc87e3b
Inferring config /var/lib/ceph/f2d0cd6e-8e43-11f0-aa90-a036bcc87e3b/mon.ceph2401/config
Using ceph image with id 'aade1b12b8e6' and tag 'v19' created on 2025-07-17 19:53:27 +0000 UTC
quay.io/ceph/ceph@sha256:af0c5903e901e329adabe219dfc8d0c3efc1f05102a753902f33ee16c26b6cee
[ceph: root@ceph2401 /]# ceph -s
cluster:
id: f2d0cd6e-8e43-11f0-aa90-a036bcc87e3b
health: HEALTH_WARN
1 failed cephadm daemon(s)
services:
mon: 5 daemons, quorum ceph2401,ceph2402,ceph2405,ceph2403,ceph2404 (age 3M)
mgr: ceph2402.rktinf(active, since 4M), standbys: ceph2401.vvyykk
mds: 3/3 daemons up, 2 standby
osd: 120 osds: 119 up (since 9h), 119 in (since 9h)
data:
volumes: 1/1 healthy
pools: 7 pools, 2820 pgs
objects: 268.74M objects, 111 TiB
usage: 133 TiB used, 1.5 PiB / 1.6 PiB avail
pgs: 2820 active+clean
io:
client: 367 MiB/s rd, 227 MiB/s wr, 5.73k op/s rd, 315 op/s wr
[ceph: root@ceph2401 /]#
Ok, so an OSD is probably down. But to get the specific error,
[ceph: root@ceph2401 /]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.76 on ceph2403 is in error state
For some reason osd.76 is so dead this time I can't find a command to query what nvme it's supposed to be managing, but here's an indication:
[lsh@ceph2403 ~]$ sudo cephadm logs --name osd.76 | grep nvme Inferring fsid f2d0cd6e-8e43-11f0-aa90-a036bcc87e3b Apr 02 20:09:47 ceph2403 sudo[364183]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/nvme6n2 Apr 02 20:09:48 ceph2403 sudo[364187]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/nvme micron_7450_mtfdkcc15t3tfr smart-log-add --json /dev/nvme6n2 Apr 03 20:03:18 ceph2403 sudo[1648196]: ceph : PWD=/ ; USER=root ; COMMAND=/usr/sbin/smartctl -x --json=o /dev/nvme6n2
GOTO the OSD replacement KBA.