Search code examples
csvhadoophivehiveqlopencsv

Using CSV Serde with Hive create table converts all field types to string


If I create a table and specify a CSVSerde then all fields are being converted to string type.

hive> create table foo(a int, b double, c string) row format serde 'com.bizo.hive.serde.csv.CSVSerde' stored as textfile; OK Time taken: 0.22 seconds hive> describe foo; OK a string from deserializer b string from deserializer c string from deserializer Time taken: 0.063 seconds, Fetched: 3 row(s) That Serde is from https://github.com/ogrodnek/csv-serde

If I try the serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' from this page https://cwiki.apache.org/confluence/display/Hive/CSV+Serde I saw the same thing. All fields are being changed to type string.

Hive version 1.2.1 Hadoop version 2.7.0 java version "1.7.0_80"


Solution

  • Yes the com.bizo.hive.serde.csv.CSVSerde only creates strings. This is how it was built and how it will always work. There is no option to change it. I think it is likely that this would work for the majority of your variables. That being said I would.

    Use A SELECT statement using a regex-based column specification, which can be used in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none. This means you can quickly build a new table altering the types of the few variables you need to have as doubles or ints.

    set hive.support.quoted.identifiers=none;
    
    drop table if       exists database.table_name;
    create table if not exists database.table_name as
    select `(a|b|c)?+.+`
        , cast(a as double) as a
        , cast(b as double) as b
        , cast(c as double) as c
        from database.some_table
    

    ;

    You can use this method to touch only the variables that need to be changed and minmimizing query length. You can create view on top of your table to query this way. Or you could create an external table and drop the old table;