Search code examples
spring-bootspring-batchbatch-processingspring-batch-tasklet

How to disable/avoid linesToSkp(1) from next file onwards in spring batch while processing large csv file


We have large csv file with 100 millions records, and used spring batch to load, read and write to database by splitting file with 1 million records using "SystemCommandTasklet". Below is snippet,

 @Bean
@StepScope
public SystemCommandTasklet splitFileTasklet(@Value("#{jobParameters[filePath]}") final String inputFilePath) {
    SystemCommandTasklet tasklet = new SystemCommandTasklet();

    final File file = BatchUtilities.prefixFile(inputFilePath, AppConstants.PROCESSING_PREFIX);

    final String command = configProperties.getBatch().getDataLoadPrep().getSplitCommand() + " " + file.getAbsolutePath() + " " + configProperties.getBatch().getDataLoad().getInputLocation() + System.currentTimeMillis() / 1000;
    tasklet.setCommand(command);
    tasklet.setTimeout(configProperties.getBatch().getDataLoadPrep().getSplitCommandTimeout());

    executionContext.put(AppConstants.FILE_PATH_PARAM, file.getPath());

    return tasklet;
}

and batch-config:

batch:
  data-load-prep:
    input-location: /mnt/mlr/prep/
    split-command: split -l 1000000 --additional-suffix=.csv       
    split-command-timeout: 900000 # 15 min
    schedule: "*/60 * * * * *"
    lock-at-most: 5m

With above config, I could able to read load and write successfully to database. However, found a bug with below snippet that, after splitting the file, only first file will have headers, but next splitted file does not have hearders in the first line. So, I have to either disable or avoid linesToSkip(1) config for FlatFileItemReader(CSVReader).

    @Configuration
public class DataLoadReader {

    @Bean
    @StepScope
    public FlatFileItemReader<DemographicData> demographicDataCSVReader(@Value("#{jobExecutionContext[filePath]}") final String filePath) {
        return new FlatFileItemReaderBuilder<DemographicData>()
                .name("data-load-csv-reader")
                .resource(new FileSystemResource(filePath))
                .linesToSkip(1) // Need to avoid this from 2nd splitted file onwards as splitted file does not have headers
                .lineMapper(lineMapper())
                .build();
    }

    public LineMapper<DemographicData> lineMapper() {
        DefaultLineMapper<DemographicData> defaultLineMapper = new DefaultLineMapper<>();
        DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();

        lineTokenizer.setNames("id", "mdl65DecileNum", "mdl66DecileNum", "hhId", "dob", "firstName", "middleName",
                "lastName", "addressLine1", "addressLine2", "cityName", "stdCode", "zipCode", "zipp4Code", "fipsCntyCd",
                "fipsStCd", "langName", "regionName", "fipsCntyName", "estimatedIncome");

        defaultLineMapper.setLineTokenizer(lineTokenizer);
        defaultLineMapper.setFieldSetMapper(new DemographicDataFieldSetMapper());
        return defaultLineMapper;
    }
}

Note: Loader should not skip first row from second file while loading.

Thank you in advance. Appreciate any suggestions.


Solution

  • I would do it in the SystemCommandTasklet with the following command:

    tail -n +2 data.csv | split -l 1000000 --additional-suffix=.csv
    

    If you really want to do it with Java in your Spring Batch job, you can use a custom reader or an item processor that filters the header. But I would not recommend this approach as it introduces an additional test for each item (given the large number of lines in your input file, this could impact the performance of your job).