I have scenarios where jobs just stay in status "STARTED" but the server crashed in the mean time. I found the following instruction in the Spring Batch Docu:
https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html#stoppingAJob
If the process died (kill -9 or server failure), the job is, of course, not running, but the JobRepository has no way of knowing because no one told it before the process died. You have to tell it manually that you know that the execution either failed or should be considered aborted (change its status to FAILED or ABANDONED). This is a business decision, and there is no way to automate it. Change the status to FAILED only if it is restartable and you know that the restart data is valid.
The JobOperator https://docs.spring.io/spring-batch/docs/current/api/org/springframework/batch/core/launch/JobOperator.html has only a function for "abandon" and "stop" but as Stop can take for ever. I do not understand how to set a Job to status "FAILED" as describe in the text above.
I have a cleanup task that runs when my server starts up, to find any batch jobs that were previously running on the same server that are still in running status and changes the status to FAILED. You might be able to do something similar to this. When I launch a job (or restart it), I add the server_name job parameter so I can identify which server is running a job.
@Configuration
public class SpringBatchCleanupTask implements CommandLineRunner {
private static final Logger LOGGER = LoggerFactory.getLogger(SpringBatchCleanupTask .class);
@Autowired
private JobRegistry jobRegistry;
@Autowired
private JobExplorer explorer;
@Autowired
private JobRepository jobRepository;
@Autowired
private SpringBatchJobLauncher jobLauncher;
@Value("${spring.batch.server-name}")
private String serverName;
@Override
public void run(String... args) throws Exception {
LOGGER.info("Starting spring batch cleanup on {}...", serverName);
for (String jobName : jobRegistry.getJobNames()) {
Set<JobExecution> executions = explorer.findRunningJobExecutions(jobName);
for (JobExecution jobExecution : executions) {
// stale job - force cleanup by setting end date and status
if (isRunningOnThisServer(jobExecution)) {
LOGGER.warn("failing job execution id: " + jobExecution.getId());
Date now = new Date();
ExitStatus cleanupTaskExitStatus = new ExitStatus("SERVER_RESTARTED");
jobExecution.setStatus(BatchStatus.FAILED);
jobExecution.setExitStatus(cleanupTaskExitStatus);
jobExecution.setEndTime(now);
jobRepository.update(jobExecution);
for (StepExecution stepExecution : jobExecution.getStepExecutions()) {
if (isRunning(stepExecution)) {
stepExecution.setStatus(BatchStatus.FAILED);
stepExecution.setExitStatus(cleanupTaskExitStatus);
stepExecution.setEndTime(now );
jobRepository.update(stepExecution);
}
}
}
}
}
}
private boolean isRunningOnThisServer(JobExecution jobExecution) {
if (jobExecution.isRunning()) {
String serverNameParam = jobExecution.getJobParameters()
.getString("server_name");
if (serverNameParam != null && serverNameParam.equals(serverName)) {
return true;
}
}
return false;
}
private boolean isRunning(StepExecution stepExecution) {
switch (stepExecution.getStatus()) {
case STARTED:
case STARTING:
case STOPPING:
case UNKNOWN:
return true;
default:
return stepExecution.getEndTime() == null;
}
}
}