Search code examples
jsonenumshiveavrohortonworks-data-platform

Avro with Enum in Hive


I have this (shortened) avro schema:

{
  "type": "record",
  "name": "license_upsert",
  "namespace": "data.model",
  "fields": [
    { "name": "upsert", "type":
      {
        "name":    "EventType",
        "type":    "enum",
        "symbols": ["INSERT", "UPDATE"]
      }
    }
  ]
}

Which just defined an ENUM.

I can easily create a avro file from some json data:

{
  "upsert": "INSERT"
}

Using the avro-tools, it all works fine, to and from avro.

Now, these avro files are loaded in an external table in Hive, and boom, hive tells me that:

java.io.IOException: org.apache.avro.AvroTypeException: Found string, expecting data.model.EventType

According to the doc, hive does not actually support enum, but if I DESCRIBE the table, the field is seen as a string:

col_name | data_type | comment
-------------------------------
upsert   |   string  |    ""

Is there a way for me to tell hive that it should use a string? Even if I run a query not selecting the upsert field, I will get the same error.

Note1:

I create table as follow:

CREATE EXTERNAL TABLE IF NOT EXISTS events.event
PARTITIONED BY (year INT, month INT, day INT, hour INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
  'avro.schema.url'='file:///path/to/event.avsc'
)
STORED AS AVRO
LOCATION '/events/event'
;

Note2:

If I generate data from the avro-tools (random command) the data is perfectly loaded in Hive.

The data I am actually using is created by confluent.


Solution

  • The reason is that as said in the last line of the question:

    The data I am actually using is created by confluent.

    It turns out that on output with the HDFS sink, ENUMs are converted to Strings. As I created external tables in Hive based on my original schema, there was a discrepancy. Now if I just extract the schema from the file created by the hdfs sink, and use this one in the table definition everything works as expected.