Search code examples
hadooploadhivecreate-table

Collecting data from file using Hive


I Have my data in this format.

"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";

the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.

Can you tell me how to load this data from file to hive table

the below query which i am using now obviously not working ;

create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

if possible i want the userid and year fields to be stored as Int. Please help

Thanks, Harish


Solution

  • The thing you are missing is RegexSerDe. It's very helpful in inserting only a part of text from the input. Your DDL goes like :

    create table books ( isbn string, title string, year string, publisher string ) 
      ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
      WITH SERDEPROPERTIES  (
         "input.regex" = "(?:\")(\\d*)(?:\"\;\")([^\"]*)(?:\"\;\")(\\d*)(?:\"\;\")([^\"]*)\"(?:\;)" ,
         "output.format.string" = "%1$s %2$s %3$s %4$s"
        )
      STORED AS TEXTFILE;
    

    The regex may look complex at the first sight due the escaping and non-capturing groups. Actually it contains 2 groups (\d*) & ([^"]*) placed alternately two times. The non-capturing groups ((?:) just helps to remove the unnecessary context. The group ([^"]*) also take care of ';' inside bookName field.

    But nothing comes without a cost. Despite all of its features, RegexSerDe supports only string fields. All you can do is to call the default hive UDF cast to do the transformation when selecting the data from the table. eg(actual syntax may vary a bit) :

     SELECT cast( year as int ) from books;
    

    Hope this helps.