Search code examples
linuxawksedsplittext-processing

split the file based on header and footer lines


I have a text file structured like this:

[timestamp1] header with space
[timestamp2] data1 
[timestamp3] data2
[timestamp4] data3
[timestamp5] ..
[timestamp6] footer with space
[timestamp7] junk
[timestamp8] header with space
[timestamp9] data4
[timestamp10] data5
[timestamp11] ...
[timestamp12] footer with space
[timestamp13] junk
[timestamp14] header with space
[timestamp15] data6
[timestamp16] data7
[timestamp17] data8
[timestamp18] ..
[timestamp19] footer with space

I need to find each part between header and footer and save it in another file. For example the file1 should contain (with or without timestamps; doesn't matter):

data1
data2
data3
..

and the next pack should be saved as file2 and so on. This seems like a routine process, but I haven't find a solution yet.

I have this sed command that finds the first packet.

sed -n "/header/,/footer/{p;/footer/q}" file

But I don't know how to iterate that over the next matches. Maybe I should delete the first match after copying it to another file and repeat the same command


Solution

  • I would harness GNU AWK for this task following way, let file.txt content be

    [timestamp1] header with space
    [timestamp2] data1 
    [timestamp3] data2
    [timestamp4] data3
    [timestamp5] ..
    [timestamp6] footer with space
    [timestamp7] junk
    [timestamp8] header with space
    [timestamp9] data4
    [timestamp10] data5
    [timestamp11] ...
    [timestamp12] footer with space
    [timestamp13] junk
    [timestamp14] header with space
    [timestamp15] data6
    [timestamp16] data7
    [timestamp17] data8
    [timestamp18] ..
    [timestamp19] footer with space
    

    then

    awk '/header/{c+=1;p=1;next}/footer/{close("file" c);p=0}p{print $0 > ("file" c)}' file.txt
    

    produces file1 with content

    [timestamp1] header with space
    [timestamp2] data1 
    [timestamp3] data2
    [timestamp4] data3
    [timestamp5] ..
    

    and file2 with content

    [timestamp8] header with space
    [timestamp9] data4
    [timestamp10] data5
    [timestamp11] ...
    

    and file3 with content

    [timestamp15] data6
    [timestamp16] data7
    [timestamp17] data8
    [timestamp18] ..
    

    Explanation: my code has 3 pattern-action pairs, for line containing header I increase counter c by 1 and set flag p to 1 and go to next line so no other action is undertaken, for line cotaining footer I close file named file followed by current counter number and set flag p to 0. For lines where p is set to true I print current line ($0) to file named file followed by current counter number. If required adjust /header/ and /footer/ to contant solely on lines which are header and footer lines.

    (tested in GNU Awk 5.0.1)