Search code examples
bashshellcutuniq

Finding unique ocurrences in a csv based on a certain field in shell


I have a file emails.csv:

>cat emails.csv
1,joe,joe@gmail.com,32
2,jim,jim@hotmail.fr,23
3,steve,steve_smith@temporary.com.br,45
4,joseph,joseph@protonmail.com,23
5,jim,jim29@bluewin.ch,29
6,hilary,hilary@bluewin.ch,32

I want to keep only the first entry when I find another entry with the same last field (age) - unique entries based on the last field. The output that I want is:

1,joe,joe@gmail.com,32
2,jim,jim@hotmail.fr,23
3,steve,steve_smith@temporary.com.br,45
5,jim,jim29@bluewin.ch,29

The following script is able to do the filtering:

> cut -d, -f4 emails.csv |
> while read age1;
> do line=1;continue_loop=1 cut -d, -f4 emails.csv | while read age;
> do if [[ $age1 == $((age)) ]] && [[ $continue_loop == $1 ]];
> then cat emails.csv | head -n $line | tail -n 1;
> continue_loop=0; fi;
> let line++;
> done;
> done | sort

However, I am looking for a solution that doesn't need require two loops as this seems a bit overcomplicated.


Solution

  • sort -t, -k4 emails.csv | sed -e 's/,/ /g' | uniq -f3 | sed -e 's/ /,/g'
    

    But seems some other languages like Perl or Pyhon will help you to write more stable and not such ugly solution