Search code examples
spring-batchpartitioning

What is the best way to store job level data from Spring batch partition worker steps?


I have a spring batch job. It processes a large number of items. For each item, it calls an external service (assume a stored procedure or a REST service. This does some business calculations and updates a database. These results are used to generate some analytical reports.). Each item is independent, so I am partitioning the external calls in 10 partitions in the same JVM. For example, if there are 50 items to process, each partition will have 50/10 = 5 items to process. This external service can result a SUCCESS or FAILURE return code. All the business logic is encapsulated in this external service and therefore worker step is a tasklet which just calls the external service and receives a SUCCESS/FAILURE flag. I want to store all the SUCCESS/FAILURE flag for each item and get them when job is over. These are the approaches I can think of:

  1. Each worker step can store the item and its SUCCESS/FAILURE in a collection and store that in job execution context. Spring batch persists the execution context and I can retrieve it at the end of the job. This is the most naïve way, and causes thread contention when all 10 worker steps try to access and modify the same collection.
  2. The concurrent exceptions in 1st approach can be avoided by using a concurrent collection like CopyOnWriteArrayList. But this is too costly and the whole purpose of partitioning is defeated when each worker step is waiting to access the list.
  3. I can write the item ID and success/failure to an external table or message queue. This will avoid the issues in above 2 approaches but we are going out of spring bath framework to achieve this. I mean we are not using spring batch job execution context and using an external database or message queue.

Are there any better ways to do this?


Solution

  • You still did not answer the question about which item writer you are going to use, so I will try to answer your question and show you why this detail is key to choose the right solution to your problem.

    Here is your requirement:

    I have a spring batch job. It processes a large number of items.
    For each item, it calls an external service (assume a stored procedure
    or a REST service. This does some business calculations and updates a database.
    

    In your description, you are talking about storing item IDs with their status in the job execution context. While this is possible, what I'm saying is that if you are going to write items to a table anyway in which you have a column with a status flag, you don't need to use the job execution context at all. Hence my question:

    are you going to write the items themselves to a persistent store?
    The item writer is required in a chunk-oriented step and the solution
    depends on how you are going to write items (also, is the success/failure
    status just a flag? or a different object with more information?, etc).
    Where those items are going to be written? A table, a file, to the standard
    output with System.out ?
    

    So I will assume you going to write items to table having a status column since you said This does some business calculations and updates a database.

    You can use an item processor to do the business logic and flag item with their status (ie your domain object has a flag status that the processor sets as needed). The item writer then updates items in the database with their status. This approach solves all issues listed above by design as it does not require the job execution context and is a good option for a multi-threaded or partitioned step (since items are independent).