Search code examples
awktextcolumnsorting

How to print, with awk, duplicated fields in a column that have a specific amount of duplication


Data:

id name city language area_code
01 Juan Cali ES 44
01 José Cali ES 44
01 Pedro Cali ES 44
02 Albert Edinburgh 19
02 Mark En 19
03 Raisa Hellsinki FI 22
03 Lisa Hellsinki
04 Gian Roma IT 33
05 Loris Sicilia
05 Vera Sicilia 31

The file containing this data is in next format:

01;Juan;Cali;ES;44
01;José;Cali;ES;44
01;Pedro;Cali;ES;44
02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa;Hellsinki;;
04;Gian;Roma;IT;33
05;Loris;Sicilia;;
05;Vera;Sicilia;;31

In this data, rows with id = 02, 03, 05 have this very same field duplicated twice, so no matter what the rest of the data says, I need to be able to select only those rows that have the field id duplicated twice, so the expected result would be:

02;Albert;Edinburgh;;19
02;Mark;;En;19
03;Raisa;Hellsinki;FI;22
03;Lisa|Hellsinki;;
05;Loris;Sicilia;;
05;Vera;Sicilia;;31

So far I have only found the way to select rows duplicated any amount of times, which code is:

awk -F';' -v OFS=';' 'a[$1]++{print $0}' data.file

But I haven't been able to figure out the way to obtain only those lines with the id duplicated twice...

Update: like U2, I still haven't found what I'm looking for, but I have a new awk command that I think is closer:

awk -F';' -v OFS=';' '{a[$1]++; if (a[$1] == 2) {print $0}}' data.file

It correctly counts out the row with id 04, but includes rows with id 01 which is not exactly two times repeated but three...


Solution

  • In 2 passes:

    $ awk -F';' 'NR==FNR{cnt[$1]++; next} cnt[$1]==2' file file
    02;Albert;Edinburgh;;19
    02;Mark;;En;19
    03;Raisa;Hellsinki;FI;22
    03;Lisa;Hellsinki;;
    05;Loris;Sicilia;;
    05;Vera;Sicilia;;31
    

    or in 1 pass if your input is grouped by the first field as shown in your example (you can always sort it if not):

    $ awk -F';' '
        $1 != prev { if (cnt == 2) print buf; prev=$1; buf=$0; cnt=1; next }
        { buf=buf ORS $0; cnt++ }
        END { if (cnt == 2) print buf }
    ' file
    02;Albert;Edinburgh;;19
    02;Mark;;En;19
    03;Raisa;Hellsinki;FI;22
    03;Lisa;Hellsinki;;
    05;Loris;Sicilia;;
    05;Vera;Sicilia;;31