Search code examples
regexawkfindcut

bash: grep row if fields are equal


I need to find all the rows which have the same content in all the columns, by examining all CSV files in the system. Example:

MYCOL;1;2;3;4
MYCOL2;2;3;4;5
MYCOL3;1;1;1;1
MYCOL4;;;;

In my example, I would need to grep for MYCOL3 and MYCOL4, as they both have the same field content for all columns, it doesn't matter if there is no content.

I thought about something like this:

find / -name *.csv | xargs awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}'

But I'm missing the comparison between all the columns.


Solution

  • You could use grep command:

    $ grep -xE '[^;]*(;[^;]*)\1+' ip.txt
    MYCOL3;1;1;1;1
    MYCOL4;;;;
    
    • -x to match only whole line
    • [^;]* first field
    • (;[^;]*) capture ; followed by non ; characters (i.e. second field)
    • \1+ use the captured field to be repeated as many times as needed till end of the line

    If input has only ASCII characters, you could use LC_ALL=C grep <...> to get the results faster.

    If you have GNU grep, you can use -r option and --include= option instead of find+grep

    Also, use find <...> -exec grep <...> {} + instead of find + xargs


    Just did a sample speed check, this regex may be too bad to use with BRE/ERE. Use grep -P if available. Otherwise, use awk or perl.

    $ perl -0777 -ne 'print $_ x 1000000' ip.txt | shuf > f1
    $ du -h f1
    53M    f1
    
    $ time LC_ALL=C grep -xE '[^;]*(;[^;]*)\1+' f1 > t1
    real    0m44.815s
    
    $ time LC_ALL=C grep -xP '[^;]*(;[^;]*)\1+' f1 > t2
    real    0m0.507s
    
    $ time perl -ne 'print if /^[^;]*(;[^;]*)\1+$/' f1 > t3
    real    0m3.973s
    
    $ time LC_ALL=C awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' f1 > t4
    real    0m2.728s
    
    $ diff -sq t1 t2
    Files t1 and t2 are identical
    $ diff -sq t1 t3
    Files t1 and t3 are identical
    $ diff -sq t1 t4
    Files t1 and t4 are identical