Search code examples
pysparkhiveparquetazure-databricks

Not able to create Hive table with TIMESTAMP datatype in Azure Databricks


org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support timestamp. See HIVE-6384;

Getting above error while executing following code in Azure Databricks.

spark_session.sql("""
                    CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
                    (
                      campaign STRING,
                      status STRING,
                      file_name STRING,
                      arrival_time TIMESTAMP
                    )
                    PARTITIONED BY ( 
                      Date DATE)
                    ROW FORMAT SERDE
                      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
                    STORED AS INPUTFORMAT
                      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
                    OUTPUTFORMAT
                      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
                    LOCATION "/mnt/data_analysis/pre-processed/"
             """)

Solution

  • As per Hive-6384 Jira, Starting from Hive-1.2 you can use Timestamp,date types in parquet tables.

    Workarounds for Hive < 1.2 version:

    1. Using String type:

    CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
                        (
                          campaign STRING,
                          status STRING,
                          file_name STRING,
                          arrival_time STRING
                        )
                        PARTITIONED BY ( 
                          Date STRING)
                       Stored as parquet
                       Location '/mnt/data_analysis/pre-processed/';
    

    Then while processing you can cast arrival_time,Date cast to timestamp,date types.

    Using a view and cast the columns but views are slow.

    2. Using ORC format:

    CREATE EXTERNAL TABLE IF NOT EXISTS dev_db.processing_table
                        (
                          campaign STRING,
                          status STRING,
                          file_name STRING,
                          arrival_time Timestamp
                        )
                        PARTITIONED BY ( 
                          Date date)
                       Stored as orc
                       Location '/mnt/data_analysis/pre-processed/';
    

    ORC supports both timestamp,date type