Speaking in terms of data-related errors, drill does not give any clue where and how to find the issue. SQL syntax and logic errors are rendered somewhat comprehensible, but imagine these cases:
You have ~10GB CSV full of numbers (sales.csv):
ArticleId,CategoryId,Price,SupplierPrice,VAT
1234,23,15.19,12.45,0
1235,23,16.19,13.45,0.15
...
[83541670] lines
...
475,34.0,55.0,50,0.15 # This random error cause (CategoryId should be INT for this example)
...
[34767806] lines
...
[EOF]
Then consider a similar query:
SELECT
CAST (ArticleId as INT) as ArticleId,
CAST (CategoryId as INT) as CategoryId,
CAST (Price as DOUBLE) as Price,
CAST (SupplierPrice as double) as SupplierPrice,
CAST (VAT as DOUBLE) as VAT
from (...)/sales.csv
And you got this error:
SYSTEM ERROR: NumberFormatException:
Fragment 0:0
Please, refer to logs for more information.
Well, referring to logs:
[Error Id: 682cc450-61fb-4307-809a-fcb794fec692 on drill-staging-745f9968d4-m5pv7:31010]
at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:630) ~[drill-common-1.16.0.jar:1.16.0]
at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:363) [drill-java-exec-1.16.0.jar:1.16.0]
at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:219) [drill-java-exec-1.16.0.jar:1.16.0]
at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:329) [drill-java-exec-1.16.0.jar:1.16.0]
at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.16.0.jar:1.16.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_111-internal]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_111-internal]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111-internal]
Caused by: java.lang.NumberFormatException:
at org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeI(StringFunctionHelpers.java:96) ~[drill-java-exec-1.16.0.jar:1.16.0]
[...continues with hundred similar lines ...]
When it would be so beautiful when Drill would give us errors like ...:
ERROR casting column "Price" to Int - Invalid integer value "34.0"
Cast operation is applied after data was read thus at this point Drill does not have information from where in file incorrect data came from. Since Drill 1.16 schema provisioning support was introduced. Table schema is applied during data read, so Drill can produce better error messages (DATA_READ_ERROR
). For more information see https://drill.apache.org/docs/create-or-replace-schema/.