Search code examples
hadoophiveopencsvhive-serdehiveddl

Multiple escape characters for hive create table


I am trying to load a csv with pipe delimiter to an hive external table. The data values contain single quote, double quotes, brackets etc.. Using Open CSV version 2.3

testfile.csv

id|name|phone
1|Rahul|123
2|Kumar's|456
3|Neetu"s|789
4|Ravi [Roma]|234

Table created as -

drop table test_schema.hive_test;
CREATE EXTERNAL TABLE test_schema.hive_test (id string, name string, phone string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES
(
'separatorChar' = '|',
'quoteChar' = '[\'\"]',
'escapeChar' = '\\'
)
LOCATION '/staging/test/hive'
tblproperties ("skip.header.line.count"="1");

Output -

+-------------+---------------+----------------+
| hive_test.id|hive_test.name |hive_test.phone |
+-------------+---------------+----------------+
| 1           | Rahul         |123             |
| 2           | Kumar's       |456             |
| 3           | Neetu"s       |789             |
| 4           | NULL          |234             |
+---------------+------------------------------+

Expected Output -

+-------------+---------------+----------------+
| hive_test.id|hive_test.name |hive_test.phone |
+-------------+---------------+----------------+
| 1           | Rahul         |123             |
| 2           | Kumar's       |456             |
| 3           | Neetu"s       |789             |
| 4           | Ravi [Roma]   |234             |
+---------------+------------------------------+

The problem is we have multiple escape sequence/quote character in data and hence we need to include all of them.


Solution

  • Use LazySimpleSerDe (STORED AS TEXTFILE):

    CREATE EXTERNAL TABLE test_schema.hive_test (id string, name string, phone string) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
    ESCAPED BY '\\' 
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/staging/test/hive_test'
    tblproperties ("skip.header.line.count"="1");