Search code examples
arraysbashcsvnewlinecut

Save elements of CSV with new lines to array


for a project I have to get data from a csv-file to a websystem via rest (curl) using xml format. I got almost everything to work but a thing.

The file I received for this project is a .xlsx-file. I exported it to a .csv-file (delimiter separated) and converted the encoding to UTF-8. The original file has about 2000 rows and 30 columns.

My strategy is to "cut" the fields of the .csv-file and save the output to an array with a bash (4.3.42) script (I haven't use Python before ;) ) on SLES12 SP2 (it's a test machine).

The csv is in the following format (simplified): (Note: the description which has three rows in one cell. And that is the part that breaks my script.) The curl command creates false entries with these new-lines.

Simplified CSV

About a fourth of the data has descriptions like that.

After the export to csv, this is what the file looks like if I open it with vim:

title|description|firstname
Test|Lorem ipsum dolor sit amet,
consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat
sed diam voluptua.|Chris

This is my simplified script:

file="Kontakte.csv"

# Get the values from the file
arrV[1]=$(cut -d'|' -f1 "$file" | cut -d$'\n' -f2) # Title
arrV[2]=$(cut -d'|' -f2 "$file" | cut -d$'\n' -f2) # Description
arrV[3]=$(cut -d'|' -f3 "$file" | cut -d$'\n' -f2) # First name

echo "### Values ###"
# For reference
echo "Title: " ${arrV[1]}
echo "Description: " ${arrV[2]}
echo "Name: " ${arrV[3]}

What I get is ...

### Values ###
Title:  Test
Description:  Lorem ipsum dolor sit amet,
Name: Chris

... instead of something like that:

### Values ###
Title:  Test
Description:  Lorem ipsum dolor sit amet,\n onsetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat\n sed diam voluptua.
Name: Chris

I have a vba script to get rid of the new-lines but would be a less-than-ideal solution.

It would be nice to just put some quotes around the executions to make it work, like:

arrV[2]="$(cut -d'|' -f2 "$file" | cut -d$'\n' -f2)"

But I coudn't solve the issue like that.

Do you have an idea, what to do about these new-lines? Or is getting rid of the new-lines beforehand in Excel my only option?

Thank you in advance.


Solution

  • You are using line-oriented tools, so it's not surprising that embedded \n causes trouble. Pick a character like ~, then use grep to verify it does not appear in your .csv. Export to .csv with \r (Mac) line terminators. Then use tr & mac2unix to map newlines to your reserved character and to repair the line terminators:

    tr '\n' '~' < Kontakte.csv | mac2unix > New.csv
    

    When you're finished munging the file, tr '~' '\n' will restore matters.

    Of course, it would be much better to use an appropriate tool, such as a CSV library.