Search code examples
mongodbapache-kafkachangestream

MongoDB ChangeStream vs. Apache Kafka


I want to be able to continue processing data from MongoDB after ensuring the upsert writes to it (by me) were successful. I have two options to accomplish this:

  • Write to Kafka after the writes to MongoDB were successful (from the same job that wrote to Mongo)
  • Receive the events of the written documents through Mongo ChangeStream, and continue processing them from there

As for my understanding, I see that the advantages for Kafka are that it is distributed and enables to read from more than one instance (I understood ChangeStream doesn't easily let this one). The ChangeStream advantage I see is that it lets me know what kind of operation was that (I perform upsert, so that lets me get to know if each upsert was an insert or an update). I'm not asking which is better because it clearly serves different use cases. But are there any more features or disadvantages for any of these options that I'm missing here in my consideration?

I also understand that both enables to continue after the client's reads failed for a while (Kafka within its quota and ChangeStream with the resume token)


Solution

  • Two-phase commits can cause inconsistencies; you should only write to one location, and the one that you think is more highly available.

    If you already have Kafka, you can write to Mongo then use tooling such as Debezium to stream data from the oplog into Kafka (including the operation, for example). This is referred to as the "outbox pattern".

    Or you can write to Kafka and use the MongoDB sink connector to send data to the database.