Search code examples
csvapache-sparkshopwaregoogle-cloud-data-fusion

Cloud Data Fusion problems reading a CSV export with the HTTP source


I am trying Cloud Data Fusion for the first time. I have this endpoint I'd like to consume testwise:

https://waidlife.com/backend/export/index/export.csv?feedID=1&hash=4ebfa063359a73c356913df45b3fbe7f (This is a shopware export)

The header row tells the following structure:

id,title,description,link,image_link,price,availability,condition,google_product_category

When configuring the HTTP Source (a plugin available in the Data Fusion Hub) I setup the following records (please note that I set the google_product_category to be nullable)

enter image description here

I also configure it to have CSV as format and skip the header row:

enter image description here

Now if you look at the API endpoint URL (mentioned above) you realize that the column google_product_category is empty. I'd think that this wouldn't be a problem because the Output Schema for Data Fusion simply could ignore the value there

2021-02-25 19:38:37,192 - ERROR [Executor task launch worker for task 0:o.a.s.u.Utils@91] - Aborting task
java.lang.RuntimeException: Cannot convert line '"10042","NeoShell Reliance Jacket","Das Filson NeoShell Reliance Jacket besteht aus Polartec  NeoShell  der aktuell atmungsaktivsten und wasserdichtesten Membrane die es gibt. Im Gegensatz zu gewöhnlichem Shell-Material, ist NeoShell  besonders soft und geräuscharm und eignet sich somit auch perfekt für die Jagd. Die Nähte der wasserdichten Reißverschlüsse sind vollständig versiegelt. Die Reißverschlüsse unter den Achseln verhindern, dass sich bei hoher Aktivität Wärme anstaut und sorgen für die richtige Belüftung. Die...","https://www.waidlife.com/regenjacken/neoshell-reliance-jacket","https://www.waidlife.com/media/image/c8/ab/aa/NeoShellRelianceJacketLifestyle2.jpg","366.75 EUR","in stock","new",""' to a record. Reason: 'java.util.NoSuchElementException: null'
    at io.cdap.plugin.http.source.batch.HttpBatchSource.transform(HttpBatchSource.java:109) ~[1614281902851-0/:na]

I tried every possible combination of configurations but could just not figure out why the whole thing just won't run successfully.

For reproduction here is the JSON export for the whole pipeline: https://pastebin.com/0qkvTSvh


Solution

  • This is happening because of having additional , characters within the quoted string. As of now we do not support CSV with quoted fields having delimiter. If this is just a test input, I suggest you to try with string values that do not have , within. Null values are supported and should work as expected.

    I have created a bug for this.