Search code examples
springspring-batch

Spring Bach processor changing state used in chunk select affects the number of processed items


I have the situation where the ItemReader selects items in the table by the state and the Processor changes the state (from READY to PROCESSED). These processed items are not (of course) covered in the next chunk's select causing that some items are not processed at all.

First chunk selects and processes items with id 1-10 correctly setting the state to PROCESSED

SELECT t.* FROM the_table t WHERE t.status = READY OFFSET 0 FETCH NEXT 10 ROWS ONLY;

Next chunk tries to select ids 11-20 by

SELECT t.* FROM the_table t WHERE t.status = READY OFFSET 10 FETCH NEXT 10 ROWS ONLY;

but since the 1-10 are not covered into select anymore (changed by first chunk), ids 21-30 are selected. Items 11-20 are never processed

It's proved also by the fact that setting the chunk size big enough to cover all potential READY items provides the desired result (writer saves all items at once). But it is not the solution for production as large amount of items is expected etc.

This results that some items are not processed at all depending on how many parallel chunks do I set.

What is the standard solution to handle this case?

  • pre-select ids to process in the next step? Is is possible to do chunk processing on the result of previous step?
  • how to work with fixed list of items through entire step (selected just once)

Note: the above select is simplified as more complex select is used in ItemReader selecting the latest processable item grouped by specific column (ssn) while there is not newer item in SUCCESS state

                    SELECT bi.*, ri.*
                    FROM ri_item ri
                             JOIN item bi ON ri.item_id = bi.id
                    WHERE (ri.ssn, bi.created_timestamp) IN
                          (SELECT ri2.ssn, MAX(bi2.created_timestamp)
                           FROM ri_item ri2
                                    JOIN item bi2 ON ri2.item_id = bi2.id
                           WHERE ri2.status IN
                                 ('READY', 'FAILED_RECOVERABLE')
                             AND bi2.created_timestamp > COALESCE((SELECT MAX(bi3.created_timestamp)
                                                                   FROM ri_item ri3
                                                                            JOIN item bi3 ON ri3.item_id = bi3.id
                                                                   WHERE ri3.status IN ('SUCCESS')
                                                                     AND ri3.ssn = ri2.ssn),
                                                                  TO_TIMESTAMP('1970-01-01', 'YYYY-MM-DD'))
                           GROUP BY ri2.ssn
                          )
                          )

Thank you

Edit: I gave a try to this advice but the result is the same. Obviously. Can anyone provide me with simple example how to use the process indicator pattern? Is this the case of it?

Thank you very much!!


Solution

  • Well the problem was that the processor in chunk step was modyfying the status. The next chunk's select was then running on different set of data: select ... offset 10 limit 10

    The result was that half of the items to process have been skipped for the job run.

    Solution was to use "staging pattern" which at first selects ids to process into separate table and then performing the chunk select against this fixed staging list.

    The staging table could be cleared in some cleanup step afterwards