Search code examples
postgresqlneo4jetlairflowluigi

Right tool for near-real-time ETL architecture


We have a system where our primary data store (and "Universal Source of Truth") is Postgres, but we replicate that data both in real-time as well as nightly in aggregate. We currently replicate to Elasticsearch, Redis, Redshift (nightly only), and are adding Neo4j as well.

Our ETL pipeline has gotten expansive enough that we're starting to look at tools like Airflow and Luigi, but from what I can tell from my initial research, these tools are meant almost entirely for batch loads in aggregate.

Is there any tool that can handle an ETL process that can handle both large batch ETL processes as well as on-the-fly, high-volume, individual-record replication? Do Airflow or Luigi handle this and I just missed it?

Thanks!


Solution

  • As far as Luigi goes, you would likely end up with a micro batch approach, running the jobs on a short interval. For example, you could trigger a cron job every minute to check for new records in Postgres tables and process that batch. You can create a task for each item, so that your processing flow itself is around a single item. At high volumes, say more than a few hundred updates per second, this is a real challenge.

    Apache Spark has a scalable batch mode and micro-batch mode of processing, and some basic pipelining operators that can be adapted to ETL. However, the complexity level of the solution in terms of supporting infrastructure goes up a quite a bit.