When using drill script to convert a set of parquet
files to csv
, running into problem where some of the fields have commas in them. This causes a problem because drill does not appear to automatically add any escaping "<field>"
or '<field>'
characters around the fields in the converted files (eg. for string values that look like "Soft-drink, Large").
The script looks like
/opt/mapr/drill/drill-1.11.0/bin/sqlline \
-u jdbc:drill:zk=node001:5181,node002:5181,node003:5181 \
-n $(tail -n+1 $basedir/src/drill-creds.txt | head -1) \
-p $(tail -n+2 $basedir/src/drill-creds.txt | head -1) \
--run=$sqldir
where the sql being run by drill looks like
alter session set `store.format`='csv';
create table dfs.myworkspace.`/path/to/csv/destination` as
select ....
from dfs.myworkspace.`/path/to/origin/files`
Does anyone have any common ways they work around this? Is there a way to add escaping characters to the converted csv files (tried checking the docs (https://drill.apache.org/docs/create-table-as-ctas/), but could find nothing related)?
You can use TSV or PSV formats or to update the delimiter symbol for CSV cells by configuring Format Plugin of your File Storage Plugin:
In Drill UI choose Storage, then Update dfs storage plugin, find CSV in formats and add delimiter property with a desired value.
"csv": {
"type": "text",
"extensions": [
"csv2"
],
"skipFirstLine": false,
"extractHeader": true,
"delimiter": "^"
},
Also please check escape property
See more: https://drill.apache.org/docs/text-files-csv-tsv-psv/#use-a-distributed-file-system