Search code examples
qsubtorquesupercomputers

Why Torque qsub don't create output file?


I trying start task on cluster via Torque PBS with command

qsub -o a.txt a.sh

File a.sh contain single string:

hostname

After command qsub I make qstat command, that give next output:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
302937.voms               a.sh             user            00:00:00 E long

After 5 seconds command qstat return empty output (no jobs in queue). Command

qsub --version

give output: version: 2.5.13

Command

which qsub

Output: /usr/bin/qsub

The problem is that the file a.txt (from command qsub -o a.txt a.sh) is not created! In the terminal returned only job id, there is not any errors. Command

 qsub a.sh

has the same behavior. How I can fix it? Where is qsub log files with errors?

If I use command

qsub -l nodes=node36:ppn=1 -o a.txt a.sh

then output files I can find in folder

/var/spool/pbs/undelivered

on node36 (after ssh login on it). Output file contain string "node36", error file is empty. Why my files is "undelivered"?


Solution

  • The output log and error log files are kept on the execution node in a spool directory and copied back to the head node after the job has completed. The location of the spool directory may vary. But you should look for it under /var/torque/spool on the first node from the list of nodes the job has been allocated.

    There are multiple reasons that might cause torque to fail to deliver the output files.

    1. The user submitting the job might not exist on the node or their home directory might not be accessible, or there is a user ID mismatch between the nodes of the cluster.
    2. Torque is using ssh to copy files to the head node, but passwordless public key authentication for the user to ssh across the cluster has not been set up consistently on all the nodes.
    3. A node failed during the execution of the job.

    This list is by no means complete. Already here on Stack Overflow one can find a number of questions dealing with such a failure. Try to check if any of the above applies to your case.