Search code examples
apache-kafkaavrodata-conversion

How to approach the conversion of Kafka messages from and to other systems?


I would like to have a series of small stand-alone services that would either consume a Kafka topic and output the data into a different system or the reverse: receive data from a system and produce Kafka messages. What would be the right approach for such an application?

Example 1: Converting a SQL query result into a stream of Kafka messages

Let's take for example the case DB -> Kafka. Ideally the service would be configured with an Avro schema and a SQL query along with the connection configuration (URL, credentials, topic, consumer group etc.)

schema.avsc:

{
   "type" : "record",
   "namespace" : "BookExample",
   "name" : "book",
   "fields" : [
      { "name" : "title" , "type" : "string" },
      { "name" : "year" , "type" : "int" }
   ]
}

query.sql:

SELECT title, year from books;

And once started, the application would execute the query and pipe the results into Kafka.

Now, since both the input and output are configurations the system cannot be type checked at compile time. It would have to throw some form of error (parsing error?) at runtime once it tries to run.

Note also, that the application is underdefined. For simplicity it would map directly columns to fields of the same name (since order is not guaranteed for an Avro schema). That is fine. Maybe a more complex application could take a map {columnName:fieldName} but that is not important for this example.

Example 2: Persisting Kafka messages into a DB table

The same as Example 1 but in reverse. Now the SQL query is not even needed. Only a table name as configuration is needed (assuming the column-field convention as before).

The application would consume a Kafka topic with the Avro schema above and write each message as a row in the target table.

Example 3: HTTP

The same could be done for a web service that receives a JSON payload and publishes it to the configured Kafka topic. A 400 status code can be sent in case the payload does not conform to the Avro schema.

What have I done?

I did implement small applications as above in Scala. But the problem is I could not make them fully generic. Both the Avro schema and the table definition are needed at compile time to create the objects. That makes the application pretty inflexible.

What should I do?

My first thought was to implement it in a untyped or dynamic typed language (Python?). But that started to look a lot like parsing text input and generating code (in case of SQL). That is why I wrote this question. I'm not sure this is the right approach. It looks to me that this application is a kind of compiler/interpreter? Transforming one type of data (text?) into another. I

Available tools

I know about Kafka Connect (though I never used it) and it seems to me it is something very tightly coupled to the Kafka broker. I was wondering if there could be a lightweight application, easily deployed and transparent to the Kafka broker. For the broker the application would look like a normal consumer or producer.

Other SO question

I saw this question, but the solution implies changing the receiving application to deserialize Kafka messages. I want an application in the middle so that both ends (Kafka and the other system) don't know about each other.


Solution

  • DB -> Kafka. Ideally the service would be configured with an Avro schema and a SQL query along with the connection configuration

    Debezium does this, sure.

    Map directly columns to fields of the same name

    That's exactly what it does

    Example 1 but in reverse

    This is what Kafka Connect Sinks do, yes. (Please try it)

    web service that receives a JSON payload and publishes it to the configured Kafka topic

    Confluent has a Kafka REST Proxy for this, but there are other solutions as well like Strimzi Bridge, karapace, etc.

    something very tightly coupled to the Kafka broker

    You've written a Kafka Client already... That's the same level of coupling as Connect. You should not run Connect on the broker servers.

    For the broker the application would look like a normal consumer or producer.

    That's what Connect (and Debezium that builds over this) do, yes.

    I want an application in the middle

    Kafka Streams? Or Apache Spark / Flink / NiFi all support databases and Avro... You need a Stream Processing framework here, overall.