Search code examples
spring-batch

Spring Batch: How to set status FAILED to a JOB that is in status STARTED but is not running


I have scenarios where jobs just stay in status "STARTED" but the server crashed in the mean time. I found the following instruction in the Spring Batch Docu:

https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html#stoppingAJob

If the process died (kill -9 or server failure), the job is, of course, not running, but the JobRepository has no way of knowing because no one told it before the process died. You have to tell it manually that you know that the execution either failed or should be considered aborted (change its status to FAILED or ABANDONED). This is a business decision, and there is no way to automate it. Change the status to FAILED only if it is restartable and you know that the restart data is valid.

The JobOperator https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/launch/JobOperator.html has only a function for "abandon" and "stop" but as Stop can take for ever. I do not understand how to set a Job to status "FAILED" as describe in the text above.


Solution

  • I have a cleanup task that runs when my server starts up, to find any batch jobs that were previously running on the same server that are still in running status and changes the status to FAILED. You might be able to do something similar to this. When I launch a job (or restart it), I add the server_name job parameter so I can identify which server is running a job.

    @Configuration
    public class SpringBatchCleanupTask implements CommandLineRunner {
    
        private static final Logger LOGGER = LoggerFactory.getLogger(SpringBatchCleanupTask .class);
    
        @Autowired
        private JobRegistry jobRegistry;
    
        @Autowired
        private JobExplorer explorer;
    
        @Autowired
        private JobRepository jobRepository;
    
        @Autowired
        private SpringBatchJobLauncher jobLauncher;
    
        @Value("${spring.batch.server-name}")
        private String serverName;
    
        @Override
        public void run(String... args) throws Exception {
            LOGGER.info("Starting spring batch cleanup on {}...", serverName);
    
            for (String jobName : jobRegistry.getJobNames()) {
    
                Set<JobExecution> executions = explorer.findRunningJobExecutions(jobName);
    
                for (JobExecution jobExecution : executions) {
                    // stale job - force cleanup by setting end date and status
                    if (isRunningOnThisServer(jobExecution)) {
                        LOGGER.warn("failing job execution id: " + jobExecution.getId());
                        Date now = new Date();
                        ExitStatus cleanupTaskExitStatus = new ExitStatus("SERVER_RESTARTED");
    
                        jobExecution.setStatus(BatchStatus.FAILED);
                        jobExecution.setExitStatus(cleanupTaskExitStatus);
                        jobExecution.setEndTime(now);
                        jobRepository.update(jobExecution);
    
                        for (StepExecution stepExecution : jobExecution.getStepExecutions()) {
                            if (isRunning(stepExecution)) {
                                stepExecution.setStatus(BatchStatus.FAILED);
                                stepExecution.setExitStatus(cleanupTaskExitStatus);
                                stepExecution.setEndTime(now );
                                jobRepository.update(stepExecution);
                            }
                        }
                    }
                }
            }
        }
    
        private boolean isRunningOnThisServer(JobExecution jobExecution) {
            if (jobExecution.isRunning()) {
                String serverNameParam = jobExecution.getJobParameters()
                        .getString("server_name");
                if (serverNameParam  != null && serverNameParam.equals(serverName)) {
                    return true;
                }
            }
    
            return false;
        }
    
        private boolean isRunning(StepExecution stepExecution) {
            switch (stepExecution.getStatus()) {
            case STARTED:
            case STARTING:
            case STOPPING:
            case UNKNOWN:
                return true;
    
            default:
                return stepExecution.getEndTime() == null;
            }
        }
    }