Search code examples
csvapache-drill

Apache drill cannot parse CSV files with windows EOL correctly?


Ok, let's save someone 8 hours of clueless debugging.

TL;DR: Apache drill cannot correctly parse CSV files generated on windows machines. That's because their EOL is set to \r\n by default unlike to unix system, where it is set to \n. And this leads to horribly undebuggable errors because the leading \r probably stays clued to the last field's value. And what's funny, you won't notice this because it's invisible.

Let's have two files, one created in linux and the second in windows: hello.linux.csv and hello.win.csv. The content is the same (at least it looks like it is ...)

field_a,field_b
Hello,0.5

Let's have a query.

SELECT * from (...)/hello.linux.csv;
---
field_a, field_b
Hello, "0.5"

SELECT * from (...)/hello.win.csv;
---
field_a, field_b
Hello, "0.5"

Fine! Let's do something with the data. Cast "0.5" to number should be fine (and necessary).

SELECT 
   field_a, CAST (field_b as DECIMAL(10, 2)) as test 
from (...)/hello.linux.csv;
---
field_a, test
Hello, 0.5


-- ... aaand, here we go!
SELECT 
   field_a, CAST (field_b as DECIMAL(10, 2)) as test 
from (...)/hello.win.csv;

[30038]Query execution error. Details:[
SYSTEM ERROR: NumberFormatException
Fragment 0:0
Please, refer to logs for more information.  -- In the logs, there is only useless java stacktrace, of course.
[Error Id: 3551c939-3f5b-42c1-9b58-d600da5f12a0 on drill-develop-7bdb45c597-52rnz:31010]
]
...

(And now, imagine how much time would take to reveal this on a complex production setup where the queries, data and other factors are somehow more complicated.)

The question: Is there a way how to force apache drill (v 1.15) to process CSV files created with windows EOLs?


Solution

  • You can update csv format line delimiter to \r\n but this would apply to all csv files in the scope of your text plugin. To change delimiter per table use table function.

    https://drill.apache.org/docs/plugin-configuration-basics/