Search code examples
bashcsvawksed

Replace carriage return (^M which appear by type CTRL+V) with sed in bash


I'm trying to manage a file .xlsx in bash with a format of csv.

I do convert the excell the file (.xlsx) to .csv (CSV UTF-8(comma delimited)) to start managing the columns.

However, i notice in the excell file(.xlsx) the information contains in the cells of a column newlines(\n).

Could you help to solve this issue?

The way it look at vim the converted file from xlsx to csv

The information it's not shown properly due to a problems in the segmentation of columns

In the image before the command that I apply is: cat file | awk -F '|' '{print $2"|"$7"|"$27"|"$29"|"$30"|"$31}'

  • I do think in a solution by going into a use of sed to eliminate the \n with the following command: sed ':a;N;$!ba;s/\n/ /g' file How can I replace each newline (\n) with a space using sed? And then I could substitute the carriage return (^M) (e.g.| sed -e 's/^M/\n/g') to manage properly each column with awk(e.g. input | awk -F '|' '{print $2"|"$7"|"$27"|"$29"|"$30"|"$31}')

    However, the solution I think do not work because by applying the first sed command all the information stay in the same line and then I could not manage it with the awk command.

Have you thought of another solution?

The csv file after applying sed ':a;N;$!ba;s/\n/ /g'

My expectation is to be as shown here The info of each line be with \n to be able to manage with awk each column The result I expect is as shown here by applying the awk command


Solution

  • CSV export from XLS is usually LF alone inside fields and CR-LF at the end of records, i.e. it looks like:

    $ cat -A file
    "this","foo$
    ""bar,bar""$
    here","that"^M$
    "stuff","nonsense$
    here","too"^M$
    

    which is actually just 2 records but some fields contain LFs ($ above). CRs are displayed as ^M above.

    Given that format of input, to convert the CR-LFs at the end of lines to LFs and convert any LFs within quoted fields to blanks (just removing them would corrupt the remaining text), this would work using any awk:

    $ awk '
        BEGIN { FS=OFS="\"" }
        {
            $0 = prev sep $0
            prev = $0
            sep = ORS
        }
        NF%2 {
            sub(/\r$/,"")
            for ( i=2; i<NF; i+=2 ) {
                gsub(/\n/," ",$i)
            }
            print
            prev = sep = ""
        }
    ' file
    "this","foo ""bar,bar"" here","that"
    "stuff","nonsense here","too"
    

    Note that although the input contains CR-LF at the end of each record we can't rely on awk seeing that as the underlying platform's C primitives may strip those CRs before awk sees them. In GNU awk you can get around that problem by setting BINMODE=3 but that's not portable to other awks.

    See What's the most robust way to efficiently parse CSV using awk? for more information on parsing CSVs with awk.