Search code examples
hadoophivehiveqlsandboxhue

Loading CSV data into Hive ORC tables


I have data as below:

id,name,description,category
1,sid,I am don,right?,production  

I followed the steps provided in this link. The main problem is when I insert the data into the temp table, data in the 3rd column gets separated and pushed into the 4th column i.e when it sees "," before right word it divides the data and pushes it into the next column. It is because I am getting data in CSV format and hence applying delimiter as "," while creating a temp table. So, there is a mess. How can I resolve it ??

After following steps suggested by xenodevil, I ran below query and I am getting error:

insert into perTable select * from sampleTable; 

Where sampleTable is the temp table and perTable is the ORC table and sampleTable data looks as below:

+-----------------+-------------------+--------------------------+-----------------------+--+
| sampletable.id  | sampletable.name  | sampletable.description  | sampletable.category  |
+-----------------+-------------------+--------------------------+-----------------------+--+
| 1               | sid               | I am don,right?          | production            |
+-----------------+-------------------+--------------------------+-----------------------+--+

But getting below error:

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1560140822404_0022_1_00, diagnostics=[Task failed, taskId=task_1560140822404_0022_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":"1","name":"sid","description":"I am don,right?","category":"production"}
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139) 

So what is the issue over here ??


Solution

  • How are you going to determine how many commas are coming in 3rd column? If it can contain any text, then it can contain any number of commas. This must be controlled at the file generation level, i.e. when your data is being written rather that when you are trying to read it at a later time in CSV format. While generating the CVS file, you may

    • Put a escape character before each comma which is part of data rather than a comma delimiter
    • Enclose each column's text in double quotes
    • Enclose only the columns where delimiting character (comma) is present with double quotes

    These are some of the common practices which reliably resolve such issues.

    An unreliable way, very specific to data you have presented, and which will resolve only redundant commas in your 3rd column is to use following RegexSerDe for your ROW FORMAT

    ([0-9]*),([a-zA-Z ]*),([A-Za-z ,?]*),([A-Za-z ,]*)
    

    To set this, you will need to modify the table's DDL to something like:

    CREATE TABLE `your_database_name.your_table_name`(
        `id` string,
        `name` string,
        `description` string,
        `category` string
    )
    COMMENT 'Your Comment'
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
      'input.regex'='([0-9]*),([a-zA-Z ]*),([A-Za-z ,?]*),([A-Za-z ,]*)'
     )
    STORED AS TEXT
    ;
    

    I have checked the Regex here, but you will need to adjust the DDL syntax to suite your needs.