apache-kafka schema avro apache-samza bigdata

Event stream data models

I'm working on coming-up with a set of schemas for a new eventing and stream processing system we are building at my company to tie together several currently disconnected systems. We have clearly defined 12 domain models and are now trying to put together a set of event schemas that all applications will event to out confluent (kafka) platform. These will then be picked-up and processed by samza to performa various jobs that will then populate databases for our domain-specific services.

This is all well and good and we started with one event per domain (e.g. address) But, we quickly ran across issues where we require different data for different types of events. For instance,an event to create an address requires all (or most) of the fields in the domain. Whereas an update only requires an id and what is being updated.

So, what I am looking for are some recommendations from those whom have done this in the past? Ideally, I would like to keep it clean with just one event schema per domain. That way we have one corresponding kafka queue per event that can be easily replayed to regain state or return to a specific previous state. However, it feels like the simpler and more pragmatic approach is to use a separate schema for each verb (i.e. create, update, delete)

Stack details of some relevance:

Confluent REST Proxy -> avro -> kafka -> samza -> various dbs.

Solution

The question is quite old but as it has not been answered yet I will give it a try. The thing is that your events should reflect a change in state in your business model, this would typically reflect an activity that has happened. Looking at your example you may have events like:

NewUser, the address may or may not be in your system but you still need to have a natural key for deciding whether it is the case or not.
UserRelocating, same as above
UserLeaving, same
AddressCorrection, may be a scenario where only part of the fields may need to be provided

They are obviously just examples, the events you decide on are dependent on your business model.