Search code examples
awk

awk + How do I find duplicates in a column?


How do I find duplicates in a column?

$ head countries_lat_long_int_code3.csv | cat -n
     1  country,latitude,longitude,name,code
     2  AD,42.546245,1.601554,Andorra,376
     3  AE,23.424076,53.847818,United Arab Emirates,971
     4  AF,33.93911,67.709953,Afghanistan,93
     5  AG,17.060816,-61.796428,Antigua and Barbuda,1
     6  AI,18.220554,-63.068615,Anguilla,1
     7  AL,41.153332,20.168331,Albania,355
     8  AM,40.069099,45.038189,Armenia,374
     9  AN,12.226079,-69.060087,Netherlands Antilles,599
    10  AO,-11.202692,17.873887,Angola,244

For instance this has duplicates in the 5th column.

     5  AG,17.060816,-61.796428,Antigua and Barbuda,1
     6  AI,18.220554,-63.068615,Anguilla,1

How do I view all the others in this file?

I know I can do this:

awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort

And I can eyeball and see if there is any duplicates, but is there a better way?

Or I can do this: Find out how may are there completely

$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | wc -l
210

Find out how many unique values are there

$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | uniq | wc -l
183

Therefore there are at most 27 (210-183) duplicates.

EDIT1

My desired output would be something as follows, basically all the columns but just showing the rows that are duplicates:

 5  AG,17.060816,-61.796428,Antigua and Barbuda,1
 6  AI,18.220554,-63.068615,Anguilla,1

Solution

  • This will give you the duplicated codes

    awk -F, 'a[$5]++{print $5}'
    

    if you're only interested in count of duplicate codes

    awk -F, 'a[$5]++{count++} END{print count}'
    

    To print duplicated rows try this

    awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'
    

    This will print the whole row with duplicates found in col $5:

    awk -F, 'a[$5]++{print $0}'
    

    Based on the comment by @JonathanLeffler below... If you want to print the duplicates only once. Use this alternative, also renamed the arrays to be self descriptive

    awk -F, 'count[$5]++ && !printed[$5]{print; printed[$5]++}'
    

    this keeps track of already printed keys and do not print them again.