how to compare and concatenate same lines in huge csv files in bash?

i have the following .csv files(huge in size~hundreds of MB to GB, several columns~20, not sorted, delimited by ","):

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1
name3,address3,town3,zip3,....,category3_2
name3,address3,town3,zip3,....,category3_3
name4,address4,town4,zip4,....,category4_1
name4,address4,town4,zip4,....,category4_2
name4,address4,town4,zip4,....,category4_3
name4,address4,town4,zip4,....,category4_4
name5,address5,town5,zip5,....,category5

i need to concatenate lines to only one, if they has the same lines and differs only in categories and put those categories to the last column with the ";" delimiter, e.g.:

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,....,category5

i tried it with while read ...; do read ... done < $file ,but this is only reading every file by 2lines and doesn't compare every line. also tried to save the category information to array and create merged category column, but at some line script just stops parsing it the way i want. it would be great to finish this in awk or sed, because read is REALLY slow when reading big files with a lot of columns, but if there's better way to do this with other language i'll be okay with it. thanks a LOT!

Solution

Two answers:

Shell + sed

You could use sed for this (particular) case:

( Hmm, too quick! I was wrong!

~~$ sed -e ':;N;s/^$\([^,]\+,$\{5\}\)$.*$*\n\1/\1\3;/;t' file.csv~~

)

$ sed -e ':a;$!N;s/^\(\([^,]\+,\)\{5\}\)\(.*\)*\n\1/\1\3;/;ta;P;D;$!ba' file.csv

For retrieving number of columns (-1) aka number of separators:

read line <file.csv
cols="${line//,}"
cols=$[${#line}-${#cols}]

sed -e "
    :a;
     $!N;
     s/^\(\([^,]\+,\)\{$cols\}\)\(.*\)*\n\1/\1\3;/;
     ta;
     P;
     D;
     $!ba
  " file.csv

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,....,category5

Last field only!

There is a simpler way based on last comma:

sed -e ":;$!N;s/^\(.*,\)\([^,]*\)*\n\1/\1\2;/;t;P;D;$!b" file.csv

(For whipping unwanted CR at end of lines:

sed -e ':;$!N;s/\o015//g;s/^\(.*,\)\([^,]*\)*\n\1/\1\2;/;t;P;D;$!b'

)

Pure bash (no fork)

This could be done using pure bash (this work fine using dash and busybox too!), but maybe more suitable for small files:

while read line;do
    if [ "${line%,*}" = "${last%,*}" ];then
        last="$last;${line##*,}"
    else
        echo "$last"
        last="$line"
    fi
done < file.csv
echo "$last"

name1,address1,town1,zip1,....,category1
name2,address2,town2,zip2,....,category2
name3,address3,town3,zip3,....,category3_1;category3_2;category3_3
name4,address4,town4,zip4,....,category4_1;category4_2;category4_3;category4_4
name5,address5,town5,zip5,....,category5

Note: there is no need to know number of columns, because this is based on last comma.