Search code examples
datastagelarge-data

File Splitting with DataStage (8.5)


I have a job that successfully produces a sequential file (CSV) output with some hundred million rows, can someone provide an example where the output is written to a hundred separate sequential files, each with a million rows?

What does the sequential file stage look like, how is it configured?

This is to ultimately allow QA to review any one of the individual outputs without a special text editor that can view large text files.


Solution

  • Based on the suggestion from @Mr. Llama and a lack of forthcoming solutions we decided on a simple script to be executed at the end of the scheduled DataStage event.

    #!/bin/bash
    # usage:
    # sh ./[script] [input]
    
    # check for input:
    if [ ! $# == 1 ]; then
      echo "No input file provided."
      exit
    fi
    
    # directory for output:
    mkdir split
    
    # header without content:
    head -n 1 $1 > header.csv
    
    # content without header:
    tail +2 $1 > content.csv
    
    # split content into 100000 record files:
    split -l 100000 content.csv split/data_
    
    # loop through the new split files, adding the header
    # and a '.csv' extension:
    for f in split/*; do cat header.csv $f > $f.csv; rm $f; done;
    
    # remove the temporary files:
    rm header.csv
    rm content.csv
    

    Crude but works for us in this case.