Search code examples
slurmsbatch

Slurm SBATCH does not save all system output all a job failed


I am running a job that requires a large memory on a cluster using Slurm. I used the flags --output to save the system output. This would successfully save the system output if the job finishes without error. However, if the job encounters an out-of-memory issue on the node, any system input before the error occurred would not appear in the output.log file. So output.log would only contain system output after the point the error has happened.

Is there a way for Slrum to save all system output, when a job fails, to output.log so that I can see at which point the error has occurred in the job?

Here is the batch script I am using:

#!/bin/bash -l
#SBATCH --account=qmech
#SBATCH --job-name=job
#SBATCH --exclusive
#SBATCH -C mem768
#SBATCH --mem=750gb
#SBATCH -c 32 # CPU per task
#SBATCH --time=01:00:00
#SBATCH --output=output.log
#SBATCH --error=error.log

I have looked at the Slurm documentation but am not aware there is any parameter that will achieve this.


Solution

  • sbatch doesn't truncate output files in case of errors. If your script and the programs called in it write their regular output to the standard output and their errors and warnings to the standard error output, then the regular output should be in output.log and the errors and warnings should be in error.log, as you specified in your #SBATCH directives. Depending on the amount of log data, it might be easier to read the logs if you didn't specify a separate error output file using --error. That way, the regular outputs and warnings/errors will be in the output file in the same order they occurred.