So here is the problem I am facing: I have a Ceph cluster that is undergoing a balance, but occasionally I get a Slow Requests message from the ceph -s output. I have two terminals open to the ceph cluster. One terminal is set to active watch for slow requests using the following command:
watch "ceph -s | grep -i 'slow'"
As a result, I am seeing two outcomes appear. One the Output looks like this:
Every 2.0s: ceph -s | grep -i 'slow' Sun Jul 12 02:17:49 2020
107 slow requests are blocked > 32 sec. Implicated osds 17
,27,37,51,58,81,118,122,124,137,153,160,181,197,205,217,236,259,267,28
3,309,318,323,328,343
At this point if I see slow requests pop in I need to immediately set the cluster to the following:
rbarrett@osd001:~$ sudo ceph osd set norecover
norecover is set
After which the slow requests will eventually disappear and you will have to set the cluster to continue with recovery.
Every 2.0s: ceph -s | grep -i 'slow' Sun Jul 12 02:20:07 2020
After the slow requests
disappear I have to unset the norecover
option
rbarrett@osd001:~$ sudo ceph osd unset norecover
norecover is unset
So here is my question: How can I write a script in bash to run as a process or service to do this for me?
My first thought would be to use a variable for that watch command, but then how can I set the script to run and keep an eye on the cluster?
I don't mind using python but would prefer a bash script.
I was thinking of using something like this, but I don't know if it would continually run.
#!/bin/bash
check=$(ceph -s | grep -i "slow requests")
echo $check
if [[ -n $check ]];then
echo "setting norecover flag"
sudo ceph osd set norecover
else
echo "no slow requests"
sudo ceph osd unset norecover
fi
Someone please confirm if this will work?
You can use a loop around your script to run it infinitely.
#!/bin/bash
while : ; do
if sudo ceph -s | grep -i "slow requests"; then
echo "setting norecover flag"
sudo ceph osd set norecover
else
echo "no slow requests"
sudo ceph osd unset norecover
fi
sleep 2
fi
I have added a 2-second sleep between the checks to avoid high CPU usage with this script (which could otherwise contribute to the cluster load significantly). You may want to adjust this based on your need (I'd suggest to not lower than 2 seconds).