bash: grep row if fields are equal

I need to find all the rows which have the same content in all the columns, by examining all CSV files in the system. Example:

MYCOL;1;2;3;4
MYCOL2;2;3;4;5
MYCOL3;1;1;1;1
MYCOL4;;;;

In my example, I would need to grep for MYCOL3 and MYCOL4, as they both have the same field content for all columns, it doesn't matter if there is no content.

I thought about something like this:

find / -name *.csv | xargs awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}'

But I'm missing the comparison between all the columns.

Solution

You could use grep command:

$ grep -xE '[^;]*(;[^;]*)\1+' ip.txt
MYCOL3;1;1;1;1
MYCOL4;;;;

-x to match only whole line
[^;]* first field
(;[^;]*) capture ; followed by non ; characters (i.e. second field)
\1+ use the captured field to be repeated as many times as needed till end of the line

If input has only ASCII characters, you could use LC_ALL=C grep <...> to get the results faster.

If you have GNU grep, you can use -r option and --include= option instead of find+grep

Also, use find <...> -exec grep <...> {} + instead of find + xargs

Just did a sample speed check, this regex may be too bad to use with BRE/ERE. Use grep -P if available. Otherwise, use awk or perl.

$ perl -0777 -ne 'print $_ x 1000000' ip.txt | shuf > f1
$ du -h f1
53M    f1

$ time LC_ALL=C grep -xE '[^;]*(;[^;]*)\1+' f1 > t1
real    0m44.815s

$ time LC_ALL=C grep -xP '[^;]*(;[^;]*)\1+' f1 > t2
real    0m0.507s

$ time perl -ne 'print if /^[^;]*(;[^;]*)\1+$/' f1 > t3
real    0m3.973s

$ time LC_ALL=C awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' f1 > t4
real    0m2.728s

$ diff -sq t1 t2
Files t1 and t2 are identical
$ diff -sq t1 t3
Files t1 and t3 are identical
$ diff -sq t1 t4
Files t1 and t4 are identical