Search code examples
mongodbapache-kafkaapache-kafka-connect

Can kafka connect - mongo source run as cluster (max.tasks > 1)


I'm using the following mongo-source which is supported by kafka-connect. I found that one of the configurations of the mongo source (from here) is tasks.max.

this means I can provide the connector tasks.max which is > 1, but I fail to understand what it will do behind the scene?

If it will create multiple connectors to listen to mongoDb change stream, then I will end up with duplicate messages. So, does mongo-source really has parallelism and works as a cluster? what does it do if it has more then 1 tasks.max?


Solution

  • Mongo-source doesn't support tasks.max > 1. Even if you set it greater than 1 only one task will be pulling data from mongo to Kafka.

    How many task is created depends on particular connector. Function List<Map<String, String>> Connector::taskConfigs(int maxTasks), (that should be overridden during the implementation of your connector) return the list, which size determine number of Tasks. If you check mongo-kafka source connector you will see, that it is singletonList.

    Below is a permalink to the current version (1.13.0). You can check the main branch to see whether it's still a singletonList. https://github.com/mongodb/mongo-kafka/blob/8c1a9b2bec644477507a898789dded2b3798b2d3/src/main/java/com/mongodb/kafka/connect/MongoSourceConnector.java#L91