I need to find all the rows which have the same content in all the columns, by examining all CSV files in the system. Example:
MYCOL;1;2;3;4
MYCOL2;2;3;4;5
MYCOL3;1;1;1;1
MYCOL4;;;;
In my example, I would need to grep for MYCOL3 and MYCOL4, as they both have the same field content for all columns, it doesn't matter if there is no content.
I thought about something like this:
find / -name *.csv | xargs awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}'
But I'm missing the comparison between all the columns.
You could use grep
command:
$ grep -xE '[^;]*(;[^;]*)\1+' ip.txt
MYCOL3;1;1;1;1
MYCOL4;;;;
-x
to match only whole line[^;]*
first field(;[^;]*)
capture ;
followed by non ;
characters (i.e. second field)\1+
use the captured field to be repeated as many times as needed till end of the lineIf input has only ASCII characters, you could use LC_ALL=C grep <...>
to get the results faster.
If you have GNU grep
, you can use -r
option and --include=
option instead of find+grep
Also, use find <...> -exec grep <...> {} +
instead of find + xargs
Just did a sample speed check, this regex may be too bad to use with BRE/ERE. Use grep -P
if available. Otherwise, use awk
or perl
.
$ perl -0777 -ne 'print $_ x 1000000' ip.txt | shuf > f1
$ du -h f1
53M f1
$ time LC_ALL=C grep -xE '[^;]*(;[^;]*)\1+' f1 > t1
real 0m44.815s
$ time LC_ALL=C grep -xP '[^;]*(;[^;]*)\1+' f1 > t2
real 0m0.507s
$ time perl -ne 'print if /^[^;]*(;[^;]*)\1+$/' f1 > t3
real 0m3.973s
$ time LC_ALL=C awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' f1 > t4
real 0m2.728s
$ diff -sq t1 t2
Files t1 and t2 are identical
$ diff -sq t1 t3
Files t1 and t3 are identical
$ diff -sq t1 t4
Files t1 and t4 are identical