Search code examples
bashcsvawk

How to remove duplicate comma separate strings using awk


I have a csv file like this: (named test2.csv)

lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,

I want to remove the duplicate entries

The closest I have got is the following awk command

awk '{a[$0]++} END {for (i in a) print RS i}' RS="," test2.csv

it works but causes new problems, it take the values out of order and puts them in rows like this:

,Elementary Algebra 38
,2015-05-07 15:30:43
,Sentence Skills 104
,FirstName
,LastName
,1997-05-20
,83494989

I need to keep the order they are in and keep them in one line ( I can fix the row issue, but don't know how to fix the order issue)

Update with Solution:

The answer from anubhava worked great, I added a question about removing the time from the date and Ed Morton helped out with that, here is the full query

awk 'BEGIN{RS=ORS=","} {sub(/ ..:..:..$/,"")} !seen[$0]++' test2.csv

Solution

  • You can just use this awk:

    awk 'BEGIN{RS=ORS=","} !seen[$0]++' test2.csv
    lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Elementary Algebra 38,