Search code examples
amazon-s3hiveavroapache-kafka-connectconfluent-platform

Error creating hive table from avro schema


I am trying to create a hive table by extracting the schema from Avro data which is stored in s3. Data is stored in s3 using the s3 Kafka connector. I am publishing a simple POJO to the producer.

Code for extracting the schema from the Avro data:-

for filename in os.listdir(temp_folder_path):
    filename = temp_folder_path + filename
        if filename.endswith('avro'):
            os.system(
                'java -jar /path/to/avro-jar/avro-tools-1.8.2.jar getschema {0} > {1}'.format(
                filename, filename.replace('avro', 'avsc')))

The extracted schema is then saved in an s3 bucket.

Create table query:-

CREATE EXTERNAL TABLE IF NOT EXISTS `db_name_service.table_name_change_log` PARTITIONED BY (`year` bigint,
 `month` bigint, `day` bigint, `hour` bigint) ROW FORMAT SERDE 
 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://bucket/topics/topic_name' 
 TBLPROPERTIES ( 'avro.schema.url'='s3://bucket/schemas/topic_name.avsc')

Error:-

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.avro.AvroSerdeException Schema for table must be of type RECORD. Received type: BYTES)

Schema:-

{ "type": "record", "name": "Employee", "doc" : "Represents an Employee at a company", "fields": [ {"name": 
"firstName", "type": "string", "doc": "The persons given name"}, {"name": "nickName", "type": ["null",
 "string"], "default" : null}, {"name": "lastName", "type": "string"}, {"name": "age", "type": "int",
  "default": -1}, {"name": "phoneNumber", "type": "string"} ] }

I can see data in the topics using this command ./confluent-4.1.1/bin/kafka-avro-console-consumer --topic test2_singular --bootstrap-server localhost:9092 --from-beginning

{"firstName":"A:0","nickName":{"string":"C"},"lastName":"C","age":0,"phoneNumber":"123"}

{"firstName":"A:1","nickName":{"string":"C"},"lastName":"C","age":1,"phoneNumber":"123"}

Solution

  • Schema for table must be of type RECORD. Received type: BYTES

    The only way this could happen would be if you aren't using AvroConverter for the Connect sink configuration.

    You'll also want to extract the schema from an S3 file.

    Tip: using a Lambda function to watch for avro file creations in the bucket can help with getting schemas without scanning the entire bucket or random files and be used to notify Hive/AWS Glue table schema updates