Search code examples
apache-kafkaapache-flink

Is it better to do avro deserialization in kafka connector's DeserializationSchema or after in process function


So I have a use case where I have a kafka connector consuming an avro byte array from a kafka topic and converts it to an Avro object. Seems straightforward enough, but I realized that if the deserialization fails for some reason, like not matching the schema or something, that the only options for handling that are either log an error and output an empty byte array or throw an error (which I don't see as a good idea for a long running job).

But if the kafka connector's deserializer just takes in the byte array, outputs it, and downstream process function does the validation and conversion, then if an error occurs it could write the error as an "error message" pojo to a side output to then be written to an error kafka topic that would make tracking what messages failed and the relevant data much easier.

Is there a way to already do this in the kafka connector's serialization logic or would this have some serious performance issue (like is the kafka connector's serialization logic optimized to do these conversions magnitudes faster than just doing it in a downstream function)?

Thanks for any input in advance!


Solution

  • No, there should not be a significant performance difference and doing the serialization downstream is certainly more flexible. For example, you could also run the serialization at a higher parallelism than the source, which might make sense if serialization is quite expensive in your case.

    The only downside I see right now, is that you can not use per-partition watermarking [1]. There has also been a discussion on the dev mailing list recently related to these topic [2].

    Hope this helps.

    [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Connectors-and-NULL-handling-td29695.html