I have a large file with a sample as below
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555
It's a sample file which has order header(00000) and related order details(00100, 00200 etc.) I want to split file with around 40000 lines each such that each file has order headers and order details together.
I used GNU parallel
to achieve the split of 40000 lines, But I am not able to achieve the split to satisfy the condition that makes sure that the Order Header and its related order details are all together in a line making sure that each file has around 40000 lines each
For the above sample file, if I have to split with around 5 lines each, I would use the below
parallel --pipe -N5 'cat > sample_{#}.txt' <sample.txt
But that would give me
sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
sample2.txt
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555
It would have 2nd Order header in the first file, and its related order details in the second one.
The desired should be
sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
sample2.txt
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555
When each Order Header has a lot of records, you might consider the simple
csplit -z sample.txt '/00000,/' '{*}'
This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.
When you do want different headers combined in a file, consider
awk -v max=40000 '
function flush() {
if (last+nr>max || sample==0) {
outfile="sample_" sample++ ".txt";
last=0;
}
for (i=0;i<nr;i++) print a[i] >> outfile;
last+=nr;
nr=0;
}
BEGIN { sample=0 }
/00000,/ { flush(); }
{a[nr++]=$0}
END { flush() }
' sample.txt