Search code examples
apache-sparkapache-kafkaapache-kafka-connectspark-structured-streaming

When is a Kafka connector preferred over a Spark streaming solution?


With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.

Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?


Solution

  • in which situations I should prefer connectors over the Spark streaming solution.

    "It Depends" :-)

    1. Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
    2. If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
    3. If you're not using Spark already, Kafka Connect is arguably more straightforward to deploy (run the JVM, pass in the configuration)
    4. As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
    5. Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
    6. Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
    7. If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)

    Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?

    1. Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
    2. Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)

    If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.


    Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)