Search code examples
apache-kafkaavroapache-kafka-connects3-kafka-connector

Kafka & Connect - how to fix AVRO Schema Data type


Setup

Multiple independent source systems push AVRO events into a Kafka topic. A Kafka S3 sink connector reads AVRO events from this topic and writes into S3 parquet format.

Problem

The AVRO schemas in our schema registry are not up to standard. E.g., a decimal field in the source system has base type string and logical type decimal in schema registry. These types of combinations are not allowed in AVRO (decimal logical type must always have base type fixes/ bytes.

These incorrect AVRO schemas result in incorrect PARQUET file schemas. E.g, in parquet, the decimal field has type string and loses all details about its decimal format.

Question

What's the best solution to have correct AVRO types in schema registry? We cannot update the source systems to send correct types.

Should we have an SMT with our custom logic to handle the logical types? E.g., by searching for decimal logical type and change the base type & value accordingly? Or should use KStream or custom serializer/ deserializer instead of SMT? What are the other options available?


Solution

  • SMTs can be used, but will require you to write your own since it seems like you want to recursively modify the entire record.

    Kafka Streams, on the other hand, will give you more control over the data, potentially at the expense of "duplicated" topics with the format you want to write