How to Write a Bash Script that Will React to Ceph Cluster based on Watch Output for Linux

So here is the problem I am facing: I have a Ceph cluster that is undergoing a balance, but occasionally I get a Slow Requests message from the ceph -s output. I have two terminals open to the ceph cluster. One terminal is set to active watch for slow requests using the following command:

watch "ceph -s | grep -i 'slow'"

As a result, I am seeing two outcomes appear. One the Output looks like this:

Every 2.0s: ceph -s | grep -i 'slow'          Sun Jul 12 02:17:49 2020

            107 slow requests are blocked > 32 sec. Implicated osds 17
,27,37,51,58,81,118,122,124,137,153,160,181,197,205,217,236,259,267,28
3,309,318,323,328,343

At this point if I see slow requests pop in I need to immediately set the cluster to the following:

rbarrett@osd001:~$ sudo ceph osd set norecover
norecover is set

After which the slow requests will eventually disappear and you will have to set the cluster to continue with recovery.

Every 2.0s: ceph -s | grep -i 'slow'          Sun Jul 12 02:20:07 2020

After the slow requests disappear I have to unset the norecover option

rbarrett@osd001:~$ sudo ceph osd unset norecover
norecover is unset

So here is my question: How can I write a script in bash to run as a process or service to do this for me?

My first thought would be to use a variable for that watch command, but then how can I set the script to run and keep an eye on the cluster?

I don't mind using python but would prefer a bash script.

I was thinking of using something like this, but I don't know if it would continually run.

#!/bin/bash
check=$(ceph -s | grep -i "slow requests")
echo $check
if [[ -n $check   ]];then
  echo "setting norecover flag"
  sudo ceph osd set norecover
else
  echo "no slow requests"
  sudo ceph osd unset norecover
fi

Someone please confirm if this will work?

Solution

You can use a loop around your script to run it infinitely.

#!/bin/bash

while : ; do
    if sudo ceph -s | grep -i "slow requests"; then
        echo "setting norecover flag"
        sudo ceph osd set norecover
    else
        echo "no slow requests"
        sudo ceph osd unset norecover
    fi
    
    sleep 2
fi

I have added a 2-second sleep between the checks to avoid high CPU usage with this script (which could otherwise contribute to the cluster load significantly). You may want to adjust this based on your need (I'd suggest to not lower than 2 seconds).