Search code examples
emailsungridengineqsub

SGE Cluster qsub email notifications not working


I'm working on an SGE cluster and having some problems with the qsub email notification system. All of my jobs work perfectly, but I seem unable to modify the default behaviour to only notify at an aborted job. The -M flag works correctly, and I do receive an email when the job is aborted, however I would like to get an email when a job begins, ends, is aborted, or suspended. I am using the following flags (and more) in my scripts, is there something stupid that I am missing?

#!/bin/bash
#$ -S /bin/bash
#$ -M email@server
#$ -m beas

program

It also does not work when I try the following:

qsub -M email@server -m baes script.sh

Is this an issue that I should take up with my cluster sys admins, or have I done something incorrectly?

Thanks for your help.


Solution

  • The important thing to understand in solving this problem is that your job status email will be sent by the node where the job runs. For example, I have a test job with the following output:

    #!/bin/bash
    #
    #$ -N MAIL
    #$ -j y
    #$ -m easb
    #$ -M pkenyon
    
    hostname
    

    Now, run the job, and see where it ran.

    [pkenyon@head ~]$ qsub mail.sh
    Your job 346 ("MAIL") has been submitted
    [pkenyon@head ~]$ cat MAIL.o346
    node03.cluster
    

    If you look at the mail logs on the system, you'll see the delivery attempts made. You'll have to diagnose from there. Here are a few examples for failures (or even successes that aren't successful in the way you want them to be):

    • Sent to the compute node address, using -M pkenyon

      ...
      Jun  5 13:56:00 node04 postfix/local[13141]: 14A3E143320: to=<[email protected]>, orig_to=<pkenyon>, relay=local, delay=0.05, delays=0.03/0/0/0.01, dsn=2.0.0, status=sent (delivered to mailbox)
      ...
      
    • Head node MX not set up right, using -M [email protected]

      ...
      Jun  5 14:00:30 node04 postfix/smtp[13283]: 35CC4143320: to=<[email protected]>, relay=none, delay=0.36, delays=0.17/0/0.19/0, dsn=5.4.4, status=bounced (Host or domain name not found. Name service error for name=head.cluster type=AAAA: Host not found)
      ...
      
    • You need to set up your system to use a local mail relay if using -M [email protected]

      ...
      Jun  5 12:20:47 node04 postfix/smtp[12798]: 1EEA5143320: to=<[email protected]>, relay=ASPMX.L.GOOGLE.com[64.233.168.27]:25, delay=0.64, delays=0.04/0/0.59/0.02, dsn=5.0.0, status=bounced (host ASPMX.L.GOOGLE.com[64.233.168.27] said: 550 Relay not permitted (in reply to RCPT TO command))
      ...
      

    So yes, you need to talk to your cluster sysadmins, but these are the first steps to figuring out where your SGE emails are hanging up. With a little more information, your admins will be able to fix the configuration issue and help you get more out of your cluster environment.