Search code examples
awktextcolumnsorting

How to merge only a specific field of different rows with awk


data.file

Istanbul;J;TK;13;OK
London;C;EN;28;OK
London;K;EN;32;OK
Paris;A;FR;30;OK
Paris;B;FR;40;OK
Zurich;G;DE;99;OK
Zurich;H;DE;33;OK
Zurich;G;DE;82;OK

Expected output:

Istanbul;J;TK;13;OK
London;C-K;EN;28-32;OK
Paris;A-B;FR;30-40;OK
Zurich;G-H;DE;33-82-99;OK

First field of every row is the condition, if this field is duplicated then merge fields 2 and 4, in field 5 just use the first occurrence.

Update: another condition is that fields 2 and 4 have to be sorted and repeated data deleted, as in the case of field 2 of Zurich.

The code I have so far is that in fields 2 and 4 data has to be sorted and repetitions deleted, as for Zurich...

awk -F';' -v OFS=';' '{getline nx; j=split (nx, Ax); for (i=1;i<=j;i++) $i=$i Ax[i]}1' data.file

Which obviously is not working as expected, this is what awfully gives back....

ParisParis;AB;FRFR;3040;OKOK
LondonLondon;CK;ENEN;2832;OKOK
IstanbulZurich;JZ;TKDE;1382;OKOK
ZurichZurich;GH;DEDE;9933;OKOK

Solution

  • With GNU awk for sorted_in:

    $ cat tst.awk
    BEGIN { FS=OFS=";" }
    $1 != prev {
        if ( NR>1 ) {
            prt()
        }
        prev = $1
        delete vals
    }
    {
        for ( fldNr=1; fldNr<=NF; fldNr++ ) {
            vals[fldNr][$fldNr]
        }
    }
    END { prt() }
    
    function prt(           fldNr,val,sep) {
        for ( fldNr=1; fldNr<=NF; fldNr++ ) {
            PROCINFO["sorted_in"] = "@ind_" (fldNr==4 ? "num" : "str") "_asc"
            sep = ""
            for ( val in vals[fldNr] ) {
                printf "%s%s", sep, val
                sep = "-"
            }
            printf "%s", (fldNr<NF ? OFS : ORS)
        }
    }
    

    $ awk -f tst.awk data.file
    Istanbul;J;TK;13;OK
    London;C-K;EN;28-32;OK
    Paris;A-B;FR;30-40;OK
    Zurich;G-H;DE;33-82-99;OK