Search code examples
scalacsvapache-sparkapache-commonsspark-csv

Spark fails to read CSV when last column name contains spaces


I have a CSV that looks like this:

+-----------------+-----------------+-----------------+
| Column One      | Column Two      | Column Three    |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+

In plain text, it actually looks like this:

Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value

My spark.read method looks like this:

val df = spark.read
    .format("csv")
    .schema(schema)
    .option("quote", "\"")
    .option("escape", "\"")
    .option("header", "true")
    .option("multiLine", "true")
    .option("mode", "DROPMALFORMED")
    .load(inputFilePath)

When multiLine is set to true, the df loads as empty. It loads fine when multiLine is set to false, but I need multiLine set to true.

If I change the name of Column Three to ColumnThree, and also update that in the schema object, then it works fine. It seems like multiLine is being applied to the header row! I was hoping that wouldn't be the case when header is also set to true.

Any ideas how to get around this? Should I be using the univocity parser instead of the default commons?

UPDATE:

I don't know why that mocked data was working fine. Here's a closer representation of the data:

CSV (Just 1 header and 1 line of data...):

Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test

Schema & the spark.read method:

val df = spark.read
  .format("csv")
  .schema(StructType(Array(
    StructField("Digital ISBN", StringType, true),
    StructField("Print ISBN", StringType, true),
    StructField("Title", StringType, true),
    StructField("File Name", StringType, true),
    StructField("Price", StringType, true),
    StructField("Description", StringType, true),
    StructField("Book Cover File Name", StringType, true)
  )))
  .option("quote", "\"")
  .option("escape", "\"")
  .option("header", "true")
  .option("multiLine", "true")
  .option("mode", "DROPMALFORMED")
  .load(inputFilePath)

df.show() result in spark-shell:

+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+

UDPATE 2:

I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....

UPDATE 3:

I opened the two files mentioned in Update2 in vim and noticed that the CSV saved by Excel had ^M instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...

Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...


Solution

  • Since the question has a few up-votes, here's the resolution to the original problem as an answer...

    Newlines in files saved in the Windows world contain both carriage return and line feed. Spark (running on Linux) sees this as a malformed row and drops it, because in its world, newlines are just line feed.

    Lessons:

    • It's important to be familiar with the origin of the file you're working with.
    • When debugging data processing issues, work with an editor that shows carriage returns.