Search code examples
springspring-batch

load and process huge volume of data in spring batch


We have a usecase to load 100M records from shared object storage bucket to Mongo DB using the below connection resource code

HttpURLConnection httpConnection = null;
try {
    httpConnection = (HttpURLConnection) this.url.openConnection();
    ResourceUtils.useCachesIfNecessary(httpConnection);
    if(StringUtils.hasText(byteRangeHeader)) {
        httpConnection.setRequestProperty("Range", String.format("bytes=%s", byteRangeHeader));
    }
    inputStream = httpConnection.getInputStream();

} catch (Exception e) {
    e.printStacktrace
}
return inputStream;

We use partition based on "range" headers and to load 100M records with 15 threads. This takes around 30 minutes. Problem with this is the http connection is getting closed by networking devices in 15 mins. How do we handle this scenario ?

  1. Load 100M records into memory and process it (I can do mulitprocess and multi thread)
  2. Connection has to be refreshed before 15mins, but if I have new connection established will spring batch reader able to continue where it was last completed ?

Solution

  • You did not mention which reader you use, but if it inherits from AbstractItemCountingItemStreamItemReader, then it could be used in a restart scenario even if it reads from a remote resource (it will resume reading from the last offset saved in the meta-data repository).

    Another option if you have enough local storage, is to download the file or stage it in a database table/collection in a first step and then process it in a next step.