I am trying to create an Hive external table
on top of some avro
files which are generated using spark-scala
. I am using CDH 5.16
which has hive 1.1
, spark 1.6
.
I created hive external table
, which ran successfully. But when i query the data i am getting NULL
for all the columns.
My problem is similar to this
Upon some research, i found out it might be the problem with schema. But i couldn't find the schema file for these avro files in the location.
I am pretty new to avro
file type. Can some one please help me out here.
Below is my spark
code snippet where i have saved the file as avro
:
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
Below is my hive external table create statement:
create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';
Below is the result i am getting when i query the data:
select * from prod_order_avro
At the same time, when i am reading these avro
files using spark-scala
as dataframe
and printing them, i am getting proper result.
Below is the spark
code i used to read these data:
val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
My question is,
avro
files, do i need to change my spark
hive
table so that schema is retrieved from the
file automatically. I read that in latest version hive takes care of
this issue by itself if schema is present in the files.Kindly help me out here
Resolved this..it was a schema issue. The schema was not embedded with the avro
files.So i had to extract schema using avro-tools
and passed it while creating table. Its working now.
I followed the below steps:
Extracted few data from avro
files stored in hdfs
into a file in
local system. Below is the command used for the same:
sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
Used avro-tools getschema
command to extract schema from this data:
avro-tools getschema /path/temp.txt
Copy the resulting schema(it will be in the form of json
data) into a new
file with .avsc
extension and upload the same into HDFS
While creating the Hive External table
add the below property to it:
TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')