Search code examples
job-schedulingqsubsungridengine

SGE submitted job state doesn't change from "qw"


I'm using Sun Grid Engine on ubuntu 14.04 to queue my jobs to be run on a multicore CPU. I've installed and set up SGE on my system. I created a "hello_world" dir which contains two shell scripts namely "hello_world.sh" & "hello_world_qsub.sh", first one including a simple command and second one including qsub command to submit the first script file as a job to be run. Here's what "hello_world.sh" includes:

#!/bin/bash

echo "Hello world" > /home/theodore/tmp/hello_world/hello_world_output.txt

And here's what "hello_world_qsub.sh" includes:

#!/bin/bash

qsub \
  -e /home/hello_world/hello_world_qsub.error \
  -o /home/hello_world/hello_world_qsub.log \
  ./hello_world.sh

after giving permission to the second sh file and running it with "./hello_world_qsub.sh" command from the specified dir, the output is reasonable:

Your job 1 ("hello_world.sh") has been submitted

But the output of "qstat" command is frustrating:

    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
    -----------------------------------------------------------------------------------------------------------------
     1 0.50000 hello_worl mhr          qw    05/16/2016 20:26:23                                    1        

And the "state" column always remains on "qw" and never changes to "r".

Here's the output of "qstat -j 1" command:

==============================================================
job_number:                 1
exec_file:                  job_scripts/1
submission_time:            Mon May 16 20:26:23 2016
owner:                      mhr
uid:                        1000
group:                      mhr
gid:                        1000
sge_o_home:                 /home/mhr
sge_o_log_name:             mhr
sge_o_path:                 /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/mhr/hello_world
sge_o_host:                 localhost
account:                    sge
stderr_path_list:           NONE:NONE:/home/hello_world/hello_world_qsub.error
mail_list:                  mhr@localhost
notify:                     FALSE
job_name:                   hello_world.sh
stdout_path_list:           NONE:NONE:/home/hello_world/hello_world_qsub.log
jobshare:                   0
env_list:                   
script_file:                ./hello_world.sh
scheduling info:            queue instance "mainqueue@localhost" dropped because it is temporarily not available
                        All queues dropped because of overload or full

And here's the output of "qhost" command:

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
localhost               -               -     -       -       -       -       -

What should I do to make my jobs run and finish their task?


Solution

  • From your qhost output, it looks like your machine "localhost" is properly configured in SGE. However, on "localhost" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "localhost".