Search code examples
linuxbashshellgnu-parallel

Split file into several files based on condition and also number of lines approximately


I have a large file with a sample as below

A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It's a sample file which has order header(00000) and related order details(00100, 00200 etc.) I want to split file with around 40000 lines each such that each file has order headers and order details together.

I used GNU parallel to achieve the split of 40000 lines, But I am not able to achieve the split to satisfy the condition that makes sure that the Order Header and its related order details are all together in a line making sure that each file has around 40000 lines each

For the above sample file, if I have to split with around 5 lines each, I would use the below

parallel --pipe -N5 'cat > sample_{#}.txt' <sample.txt

But that would give me

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555

sample2.txt
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It would have 2nd Order header in the first file, and its related order details in the second one.

The desired should be

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555

sample2.txt
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

Solution

  • When each Order Header has a lot of records, you might consider the simple

    csplit -z sample.txt '/00000,/' '{*}'
    

    This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

    When you do want different headers combined in a file, consider

    awk -v max=40000 '
       function flush() {
          if (last+nr>max || sample==0) {
             outfile="sample_" sample++ ".txt";
             last=0;
          }
          for (i=0;i<nr;i++) print a[i] >> outfile;
          last+=nr;
          nr=0;
       }
       BEGIN { sample=0 }
       /00000,/ { flush(); }
       {a[nr++]=$0}
       END { flush() }
       ' sample.txt