Search code examples
hadoophiveavrospark-avrohive-table

Hive External table on AVRO file producing only NULL data for all columns


I am trying to create an Hive external table on top of some avro files which are generated using spark-scala. I am using CDH 5.16 which has hive 1.1, spark 1.6.

I created hive external table, which ran successfully. But when i query the data i am getting NULL for all the columns. My problem is similar to this

Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.

I am pretty new to avro file type. Can some one please help me out here.

Below is my spark code snippet where i have saved the file as avro:

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

Below is my hive external table create statement:

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

Below is the result i am getting when i query the data: select * from prod_order_avro

Result

At the same time, when i am reading these avro files using spark-scala as dataframe and printing them, i am getting proper result. Below is the spark code i used to read these data:

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

avro files data when read through spark-scala

My question is,

  • While creating these avro files, do i need to change my spark
    code to create schema files separately or will it be embedded with
    the files. If needs to be separate, then how to achieve it?
  • If not how to create hive table so that schema is retrieved from the file automatically. I read that in latest version hive takes care of this issue by itself if schema is present in the files.

Kindly help me out here


Solution

  • Resolved this..it was a schema issue. The schema was not embedded with the avro files.So i had to extract schema using avro-tools and passed it while creating table. Its working now.

    I followed the below steps:

    1. Extracted few data from avro files stored in hdfs into a file in local system. Below is the command used for the same:

      sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt

    2. Used avro-tools getschema command to extract schema from this data:

      avro-tools getschema /path/temp.txt

    3. Copy the resulting schema(it will be in the form of json data) into a new file with .avsc extension and upload the same into HDFS

    4. While creating the Hive External table add the below property to it:

      TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')