bashcsvunixseddelimited

Converting a CSV file with double quotes to a pipe-delimited format using sed


I'm trying to convert CSV files into pipe-delimited text files.

Let's say I have a file called sample.csv:

aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z

I'd like to convert this into an output that looks like this:

aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z

Now after tons of searching, I have come the closest using this sed command:

sed -r 's/""/\v/g;s/("([^"]+)")?,/\2\|/g;s/"([^"]+)"$/\1/;s/\v/"/g'

However, the output that I received was:

aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|pppqqq|rrr" sss|ttt,"uuu|Z

Where the expected for the 9th column should have been ppp"qqq" but the result removed the double quotes and what I got was pppqqq.

I have been playing around with this for a while, but to no avail. Any help regarding this would be highly appreciated.


Solution

  • The problem with sample.csv is that it mixes non-quoted fields (containing quotes) with fully quoted fields (that should be treated as such).

    You can't have both at the same time. Either all fields are (treated as) unquoted and quotes are preserved, or all fields containing a quote (or separator) are fully quoted and the quotes inside are escaped with another quote.

    So, sample.csv should become:

    "aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z
    

    to give you the desired result (using a csv parser):

    aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z