Search code examples
pythonapache-kafkaredisproducer-consumer

Best data structure to maintain a table from a stream of Kafka update messages in Python


Suppose I have a tabular dataset of fixed dimension (N x M). I receive a stream of updates from Kafka updating entries in this table. Ultimately, I'd like to have a pandas dataframe with a recent version of the table, and I'm considering a few options for doing that:

  1. Maintain it in memory as a table / dataframe. My concern here, is I don't know if I can avoid multithreading, since one process will perpetually be in a for loop of receiving messages.

  2. Maintain it in an external structure, and have a separate process independently read from it. Choices of external data stores: a) SQLite - Might have concurrency issues, and updates for arbitrary rows are probably a bit messy. b) Redis - Easy to maintain, but hard to query / read the whole table at once (which is how I would generally be accessing the data).

I'm a bit of a Kafka beginner, so any advice here would be appreciated. How would you approach this problem? Thanks!

EDIT: I guess I could also just maintain it in memory and then just push the whole thing to SQLite?


Solution

  • My initial approach would be to ask: can I create a "good enough" solution to start with, and optimize it later if needed?

    Unless you need to worry about very sensitive information (like healthcare or finance data), or data that is going to definitely going to scale up very quickly, then I would suggest trying a simple solution first and then see if you hit any problems. You may not!

    Ultimately, I would probably go with the SQLite solution to start with, as it's relatively simple to set up and it's a good fit for the use case (i.e. "transactional" situations).

    Here are some considerations I would think about:

    Pros/cons of a single process

    Unless your data is high-velocity / high-volume, your suggestion of consuming and processing the data in the same process is probably fine. Processing data locally is much faster than receiving it over the network (assuming your Kafka feed isn't on your local computer), so your data ingest from Kafka would probably be the bottleneck.

    But, this could be expensive to have a Python process spinning indefinitely, and you would need to make sure to store your data out to a file or database in order to keep it from being lost if your process shut down.

    Relational database (e.g. SQLite)

    Using a relational database like SQLite is probably your best bet, once again depending on the velocity of the data you're receiving. But relational databases are used all the time for transactional purposes (in fact that's one of their primary intended purposes), meaning high volume and velocity of writes—so it would definitely make sense to persist your data in SQLite and make your updates there as well. You could see about breaking your data into separate tables if it made sense (e.g. third normal form), or you could keep it all in one table if that was a better fit.

    Maintain the table in memory

    You could also keep the table in memory, like you suggested, as long as you're persisting it to disk in some fashion (CSV, SQLite, etc.) after updates. For example, you could:

    1. Have your copy in memory.
    2. When you get an update, make the update to your in-memory table.
    3. Write the table to disk.
    4. If your process stops or restarts, read the table from memory to start.

    Pandas can be slower for accessing and updating individual values in rows, though, so it might actually make more sense to keep your table in memory as a dictionary or something and write it to disk without using pandas. But if you can get away with doing it all in pandas (re: velocity and volume), that could be a fine way to start too.