Search code examples
javaspringspring-batch

What is the performance gain of Spring Batch?


I'm trying to understand the main grip of Spring Batch. This is what I understood:

Let's think about a Job which takes Employees from a DB, process them, and then writes them on the DB.

  • I set the ItemReader to operate on chunks of 20 Employees. So, the reader retrieves 20 Employees from the DB.
  • Then, the ItemProcessor operate con the 20 Employees, one at a time (so, the Processor is called 20 times).
  • Finally, the Writer saves on the DB the 20 Employees, again, one at a time.
  • The entire job gets repeated again.

If that's how it works, what's the gain? Because it was always explained to me that if you have to save 100 elements, it's best to save a list of 100 elements one time than to save 100 elements one at a time.


Solution

  • Your explanation/understanding of how Spring Batch work and the performance gains is of, or at least not fully informed.

    If you need to process 1_000_000 rows you don't want to do that in an all or nothing way. What if you processed all of them, try to store them and one fails? You have lost your processing, you lost a lot of time processing items for nothing and nothing is there.

    Spring Batch helps with this in various ways like either skipping those broken items, or allowing you to restart the whole job where you left of.

    How the ItemWriter works depends on the actual implementation and the configuration of the underlying persistence mechanism. For JPA you can configure it to batch the update/insert statements, which will lead eventually to a single insert/update statement instead of individual ones. For other mechanism the whole list is updated at once.

    NOTE: The ItemWriter is designed to write a whole chunk, see the linked javadoc and it is even made explicit in the write method which takes a Chunk.

    Finally in normal ways it probably won't give you much of a performance gain but it will give you insights in what you are processing. It will also give you a lot of benefits with large amounts of data on recovery, continuing and restarts. Which you will be thankful for.

    Finally their are performance benefits if you start to split your workload over multiple servers and use Spring Batch in coordinating this. So instead of 1 machine doing everything you can scale and use 10 machines which all do part of the work.