Search code examples
druid

Why Apache Druid is considered real-time database?


This is a question that relates to how Druid is being marketed.

Why is it called real time database, when - as I understand - before any data can be efficiently read from DB there is a need for heavy lifting ETL using external tool (like Hive or Spark) which loads semi-aggregated data to Druid before the database writes this input in efficient, column store based manner.

My understanding would be that Druid can be considered real time in terms of communication between Druid and querying UI but not between the truth source (including real time transactions) and Druid, because of analytics (possibly multiple joins) required in between.


Solution

  • Druid supports realtime ingestion through Kafka Streaming and data is available to query immediately that is why it's being considered as a real time data store.

    Druid also supports batch ingestion as you mentioned using Hive and Spark.

    Here's the more details on Apache Druid:

    Apache druid is OLAP data store designed to provide sub-second query performance while ingesting data in realtime or in batch.

    Ways to ingest data in Druid

    • Realtime Ingestion - Druid can use Kafka topics to ingest data in real time.

    • Batch Ingestion - Druid uses Hive and Spark to read datasets from HDFS. In this case it's not real time but there are use cases which does not need to be in realtime and just needs to have a requirement of faster response time for adhoc queries.

    Where druid is a great fit:

    • Applications with event based data.

    • Less updates on data

    • Sub second response time

    When you should not consider druid

    • High number of Joins

    • More updates on data

    Hot Industries/Application for Druid

    • IOT services

    • Network monitoring

    • Digital Marketting

    • Any time based streaming application