Search code examples
linuxshellawkunique

How to find unique values based on two column using awk


Consider I have a test file (test.txt)contains data as

1,2 
2,3 
2,1 
2,2 
3,1 
1,3 
2,5
4,1

I want the remove duplicates per pair, say for pair (1,2) and (2,1) either of one should be printed (First come first print). Expected output is

1,2
2,3
3,1
2,2
2,5
4,1

I have tried this command, awk -F"," '!seen[$1,$2]++ && !seen[$2,$1]' test.txt.It prints as

1,2
2,3
3,1
2,5
4,1

Why the pair 2,2 is not printing? Also how to get the expected output.


Solution

  • As already stated, the problem comes from short-circuiting in the logical evaluation.

    To circumvent this, one option is to test and set one single value in the 2D-table seen :

    awk -F"," '!seen[($1<$2?$1:$2)+0, ($1>$2?$1:$2)+0]++'
    

    Basically it uses the minimum and the maximum of the two values as indexes, hence one single test instead of two, and then increment the slot.

    Note the +0 instruction which forces casting into integers. This have to be done because fields enclose extra whitespaces including final end-of-line.