Search code examples
pipelineapache-kafkaamazon-redshiftdata-partitioning

Updating Kafka Event Log


I am using Kafka as a pipeline to store analytics data before it gets flushed to S3 and ultimately to Redshift. I am thinking about the best architecture to store data in Kafka, so that it can easily be flushed to a data warehouse.

The issue is that I get data from three separate page events:

  1. When the page is requested.
  2. When the page is loaded
  3. When the page is unloaded

These events fire at different times (all usually within a few seconds of each other, but up to minutes/hours away from each other).

I want to eventually store a single event about a web page view in my data warehouse. For example, a single log entry as follows:

pageid=abcd-123456-abcde, site='yahoo.com' created='2015-03-09 15:15:15' loaded='2015-03-09 15:15:17' unloaded='2015-03-09 15:23:09'

How should I partition Kafka so that this can happen? I am struggling to find a partition scheme in Kafka that does not need a process using a data store like Redis to temporarily store data while merging the CREATE (initial page view) and UPDATE (subsequent load/unload events).


Solution

  • Assuming:

    • you have multiple interleaved sessions
    • you have some kind of a sessionid to identify and correlate separate events
    • you're free to implement consumer logic
    • absolute ordering of merged events are not important

    wouldn't it then be possible to use separate topics with the same number of partitions for the three kinds of events and have the consumer merge those into a single event during the flush to S3?

    As long as you have more than one total partition you would then have to make sure to use the same partition key for the different event types (e.g. modhash sessionid) and they would end up in the same (per topic corresponding) partitions. They could then be merged using a simple consumer which would read the three topics from one partition at a time. Kafka guarantees ordering within partitions but not between partitions.

    Big warning for the edge case where a broker goes down between page request and page reload though.