I have this (shortened) avro schema:
{
"type": "record",
"name": "license_upsert",
"namespace": "data.model",
"fields": [
{ "name": "upsert", "type":
{
"name": "EventType",
"type": "enum",
"symbols": ["INSERT", "UPDATE"]
}
}
]
}
Which just defined an ENUM.
I can easily create a avro file from some json data:
{
"upsert": "INSERT"
}
Using the avro-tools, it all works fine, to and from avro.
Now, these avro files are loaded in an external table in Hive, and boom, hive tells me that:
java.io.IOException: org.apache.avro.AvroTypeException: Found string, expecting data.model.EventType
According to the doc, hive does not actually support enum, but if I DESCRIBE
the table, the field is seen as a string:
col_name | data_type | comment
-------------------------------
upsert | string | ""
Is there a way for me to tell hive that it should use a string? Even if I run a query not selecting the upsert field, I will get the same error.
I create table as follow:
CREATE EXTERNAL TABLE IF NOT EXISTS events.event
PARTITIONED BY (year INT, month INT, day INT, hour INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.url'='file:///path/to/event.avsc'
)
STORED AS AVRO
LOCATION '/events/event'
;
If I generate data from the avro-tools (random
command) the data is perfectly loaded in Hive.
The data I am actually using is created by confluent.
The reason is that as said in the last line of the question:
The data I am actually using is created by confluent.
It turns out that on output with the HDFS sink, ENUM
s are converted to String
s. As I created external tables in Hive based on my original schema, there was a discrepancy. Now if I just extract the schema from the file created by the hdfs sink, and use this one in the table definition everything works as expected.