I have the command below which extracts rows that are duplicates based on their second column (file.csv
is comma-separated):
awk -F',' '{print $2}' file.csv | sort | uniq -d | grep -F -f - file.csv >file.bad
But on Solaris, this command doesn't work because of the -F
parameter of the grep
command is not supported.
Example of lines in the original file:
"A","TEST","Other","TEST",...
"A","TEST2","Other A","TEST2",...
"A","TEST","Other B","TEST",...
Example of lines in the new file:
"A","TEST","Other","TEST",...
"A","TEST","Other B","TEST",...
cat sample.csv
"A","TEST","Other","TEST",...
"A","TEST2","Other A","TEST2",...
"A","TEST","Other B","TEST",...
run below:
awk -F, 'NR==FNR{a[$2]++;next} (a[$2]>1)' sample.csv sample.csv > new_file.csv
content of new_file.csv
"A","TEST","Other","TEST",...
"A","TEST","Other B","TEST",...
What I did is it reads sample 2 times. The first time store 2nd field in an array, and then the 2nd time check if 2nd field occurrence time is more than once, we print out the line.
NR==FNR{a[$2]++;next}
store 2nd field in array a
and then go next record/line
(a[$2]>1) when the 2nd time read sample.csv file, checks the array whether 2nd field occurrence more than once, if 2nd field has duplicates, print out the line