Search code examples
csvawkgrepsolaris

Create new file with duplicities of a file


I have the command below which extracts rows that are duplicates based on their second column (file.csv is comma-separated):

awk -F',' '{print $2}' file.csv | sort | uniq -d | grep -F -f - file.csv >file.bad

But on Solaris, this command doesn't work because of the -F parameter of the grep command is not supported.

Example of lines in the original file:

"A","TEST","Other","TEST",...  
"A","TEST2","Other A","TEST2",...  
"A","TEST","Other B","TEST",...

Example of lines in the new file:

"A","TEST","Other","TEST",...  
"A","TEST","Other B","TEST",...

Solution

  • cat sample.csv

    "A","TEST","Other","TEST",...
    "A","TEST2","Other A","TEST2",...
    "A","TEST","Other B","TEST",...
    

    run below:

    awk -F, 'NR==FNR{a[$2]++;next} (a[$2]>1)' sample.csv sample.csv > new_file.csv
    

    content of new_file.csv

    "A","TEST","Other","TEST",...
    "A","TEST","Other B","TEST",...
    

    What I did is it reads sample 2 times. The first time store 2nd field in an array, and then the 2nd time check if 2nd field occurrence time is more than once, we print out the line.

    • NR==FNR{a[$2]++;next} store 2nd field in array a and then go next record/line

    • (a[$2]>1) when the 2nd time read sample.csv file, checks the array whether 2nd field occurrence more than once, if 2nd field has duplicates, print out the line