Search code examples
amazon-web-servicescsvhive-serde

Unable to get AWS Athena escapeChar working


I am trying to take AWS Athena for a spin and I am running in to issues with the a csv file I am trying to test against.

Using the following the escapeChar doesn't appear to be working. I have tried using the crawler and specifying the escapeChar in the UI, with and without doublequotes but still no luck. When a row has the delimiter in a string, it is read as a field delimiter even though it is escaped.

DDL

CREATE EXTERNAL TABLE mytestcsvtable (
    col_id          string,
    col_description string,
    col_text        string,
    col_decimal     string,
    col_float       string,
    col_date        string,
    col_time        string,
    col_timestamp   string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '|',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://mystaging2/tmp/';

Data

401|SingleBackslash|Goblin \\Gobby Gruesome|.00|0E0|2020-03-02|21.21.43|2020-03-02 12.35.52.145894
402|EvenNumberOfSingleBackslash|Goblin \\Gobby\\ Gruesome|.00|0E0|2020-03-02|22.22.44|2020-03-02 12.35.52.156563
403|SingleSingleBackslash|Goblin \\\\Gobby\\\\ Gruesome|.00|0E0|2020-03-02|23.23.45|2020-03-02 12.35.52.158011
404|OddNumberOfSingleBackslash1|Goblin \\Gruesome\\ \\Gruesome|.00|0E0|2020-03-02|00.24.46|2020-03-02 12.35.52.159835
405|OddNumberOfSingleBackslash2|Goblin \\\\Gobby\\ Gruesome|.00|0E0|2020-03-02|01.25.47|2020-03-02 12.35.52.162538
406|OddNumberOfSingleSingleBackslash1|Goblin \\\\Gruesome\\\\ \\\\Gruesome|.00|0E0|2020-03-02|02.26.48|2020-03-02 12.35.52.163510
407|OddNumberOfSingleSingleBackslash2|Goblin \\\\Gobby\\\\\\ \\\\Gruesome|.00|0E0|2020-03-02|03.27.49|2020-03-02 12.35.53.167322
408|SingleSingleSingleBackslash1|Goblin \\\\\\Gobby|.00|0E0|2020-03-02|04.28.50|2020-03-02 12.35.53.179868
501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02|05.29.51|2020-03-02 12.35.53.180025
502|EvenNumberOfSinglePipe|Goblin \|Gobby\| Gruesome|.00|0E0|2020-03-02|06.30.52|2020-03-02 12.35.53.184042
503|SingleSinglePipe|Goblin \|\|Gobby\|\| Gruesome|.00|0E0|2020-03-02|07.31.53|2020-03-02 12.35.53.189979
504|OddNumberOfSinglePipe1|Goblin \|Gruesome\| \|Gruesome|.00|0E0|2020-03-02|08.32.54|2020-03-02 12.35.53.194734
505|OddNumberOfSinglePipe2|Goblin \|\|Gobby\| Gruesome|.00|0E0|2020-03-02|09.33.55|2020-03-02 12.35.53.196996
506|OddNumberOfSingleSinglePipe1|Goblin \|\|Gruesome\|\| \|\|Gruesome|.00|0E0|2020-03-02|10.34.56|2020-03-02 12.35.53.203568
507|OddNumberOfSingleSinglePipe2|Goblin \|\|Gobby\|\|\| \|\|Gruesome|.00|0E0|2020-03-02|11.35.57|2020-03-02 12.35.53.203999
508|SingleSingleSinglePipe1|Goblin \|\|\|Gobby|.00|0E0|2020-03-02|12.36.58|2020-03-02 12.35.54.208965

enter image description here

Thanks for taking the time to read my post!


Solution

  • In csv the escapeChar is used to escape a possible quoteChar inside a quoted field, not to escape the delimiter.

    To "escape" a delimiter inside a field, the field has to be quoted - for example:

    501|SinglePipe|Goblin \|Gobby Gruesome|.00|0E0|2020-03-02
    

    becomes:

    501|SinglePipe|"Goblin \|Gobby Gruesome"|.00|0E0|2020-03-02
    or
    501|SinglePipe|"Goblin |Gobby Gruesome"|.00|0E0|2020-03-02
    

    If there is a quote inside the field, this has to be escaped:

    501|SinglePipe|"Goblin|24\" monitor|Gobby Gruesome"|.00|0E0|2020-03-02
    

    Also make sure to tell the parser to use a quoteChar if this is not the default.