Search code examples
bashdarwin

Is there an easy way to remove duplicate fields in a CSV?


I have a CSV that has dates and names, I want to sort out duplicate dates, but leave all the names by using bash tools. Eg:

Input CSV:

01/01/1960,"Torvalds, Linus"
01/01/1960,"Gates, Bill"
04/20/1960,"Woz, Steve"

Output CSV:

01/01/1960,"Torvalds, Linus"
,"Gates, Bill"
04/20/1960,"Woz, Steve"

I know I could do this using loops and line numbers, but I'm wondering if there's a better way. Perhaps using sort or uniq?


Solution

  • Using sort and awk :

    sort -t',' -k1 input.csv |awk -F, -v OFS=, '($1 in a) {gsub($1,"") } !($1 in a) {a[$1]=$1}1'
    01/01/1960,"Gates, Baleee"
    ,"Gates, Benny"
    ,"Gates, Bill"
    ,"Torvalds, Linus"
    04/20/1960,"Wozniac, pitor"
    ,"Woz, Steve"
    

    When the input.csv is as follow:

     -->cat input.csv
    01/01/1960,"Torvalds, Linus"
    01/01/1960,"Gates, Bill"
    01/01/1960,"Gates, Benny"
    04/20/1960,"Woz, Steve"
    01/01/1960,"Gates, Baleee"
    04/20/1960,"Wozniac, pitor"
    

    Explanation: Here, FS is set to comma and for each line first column is stored in an array named a if its not already present and if its already present then replace the first column with blank using gsub function. sort is used to bring dates in sequence.