Search code examples
apache-drill

How to deal with no escape commas in string fields when using `store.format`='csv' in drill table


When using drill script to convert a set of parquet files to csv, running into problem where some of the fields have commas in them. This causes a problem because drill does not appear to automatically add any escaping "<field>" or '<field>' characters around the fields in the converted files (eg. for string values that look like "Soft-drink, Large").

The script looks like

/opt/mapr/drill/drill-1.11.0/bin/sqlline \
        -u jdbc:drill:zk=node001:5181,node002:5181,node003:5181 \
        -n $(tail -n+1 $basedir/src/drill-creds.txt | head -1) \
        -p $(tail -n+2 $basedir/src/drill-creds.txt | head -1) \
        --run=$sqldir

where the sql being run by drill looks like

alter session set `store.format`='csv';
create table dfs.myworkspace.`/path/to/csv/destination` as 
select .... 
from dfs.myworkspace.`/path/to/origin/files`

Does anyone have any common ways they work around this? Is there a way to add escaping characters to the converted csv files (tried checking the docs (https://drill.apache.org/docs/create-table-as-ctas/), but could find nothing related)?


Solution

  • You can use TSV or PSV formats or to update the delimiter symbol for CSV cells by configuring Format Plugin of your File Storage Plugin:

    In Drill UI choose Storage, then Update dfs storage plugin, find CSV in formats and add delimiter property with a desired value.

    "csv": {
      "type": "text",
      "extensions": [
        "csv2"
      ],
      "skipFirstLine": false,
      "extractHeader": true,
      "delimiter": "^"
    },
    

    Also please check escape property

    See more: https://drill.apache.org/docs/text-files-csv-tsv-psv/#use-a-distributed-file-system