Search code examples
bashcsvawksedlines

how to compare and concatenate same lines in huge csv files in bash?


i have the following .csv files(huge in size~hundreds of MB to GB, several columns~20, not sorted, delimited by ","):

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1
name3,address3,town3,zip3,....,category3_2
name3,address3,town3,zip3,....,category3_3
name4,address4,town4,zip4,....,category4_1
name4,address4,town4,zip4,....,category4_2
name4,address4,town4,zip4,....,category4_3
name4,address4,town4,zip4,....,category4_4
name5,address5,town5,zip5,....,category5

i need to concatenate lines to only one, if they has the same lines and differs only in categories and put those categories to the last column with the ";" delimiter, e.g.:

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,....,category5

i tried it with while read ...; do read ... done < $file ,but this is only reading every file by 2lines and doesn't compare every line. also tried to save the category information to array and create merged category column, but at some line script just stops parsing it the way i want. it would be great to finish this in awk or sed, because read is REALLY slow when reading big files with a lot of columns, but if there's better way to do this with other language i'll be okay with it. thanks a LOT!


Solution

  • Two answers:

    Shell + sed

    You could use for this (particular) case:

    ( Hmm, too quick! I was wrong!

    $ sed -e ':;N;s/^\(\([^,]\+,\)\{5\}\)\(.*\)*\n\1/\1\3;/;t' file.csv
    

    )

    $ sed -e ':a;$!N;s/^\(\([^,]\+,\)\{5\}\)\(.*\)*\n\1/\1\3;/;ta;P;D;$!ba' file.csv
    

    For retrieving number of columns (-1) aka number of separators:

    read line <file.csv
    cols="${line//,}"
    cols=$[${#line}-${#cols}]
    
    sed -e "
        :a;
         $!N;
         s/^\(\([^,]\+,\)\{$cols\}\)\(.*\)*\n\1/\1\3;/;
         ta;
         P;
         D;
         $!ba
      " file.csv
    
    name1,address1,town1,zip1,....,category1
    name2,address2,town2,zip2,....,category2
    name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
    name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
    name5,address5,town5,zip5,....,category5
    

    Last field only!

    There is a simpler way based on last comma:

    sed -e ":;$!N;s/^\(.*,\)\([^,]*\)*\n\1/\1\2;/;t;P;D;$!b" file.csv
    

    (For whipping unwanted CR at end of lines:

    sed -e ':;$!N;s/\o015//g;s/^\(.*,\)\([^,]*\)*\n\1/\1\2;/;t;P;D;$!b'
    

    )

    Pure bash (no fork)

    This could be done using pure (this work fine using and too!), but maybe more suitable for small files:

    while read line;do
        if [ "${line%,*}" = "${last%,*}" ];then
            last="$last;${line##*,}"
        else
            echo "$last"
            last="$line"
        fi
    done < file.csv
    echo "$last"
    
    name1,address1,town1,zip1,....,category1
    name2,address2,town2,zip2,....,category2
    name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
    name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
    name5,address5,town5,zip5,....,category5
    

    Note: there is no need to know number of columns, because this is based on last comma.