Search code examples
oracleapache-kafkaactivemq-classicmaterialized-viewsmom

Oracle Materialized View for sensory data transfer


In an application we have to send sensory data stream from multiple clients to a central server over internet. One obvious solution is to use MOMs (Message Oriented Middlewares) such as Kafka, but I recently learned that we can do this with data base synchronization tools such as oracle Materialized View.

The later approach works in some application (sending data from a central server to multiple clients, inverse directin of our application), but what is the pros and cons of it in our application? Which one is better for sending sensory data stream from multiple (~100) clients to server in terms of speed, security, etc.?

Thanks.

P.S.

For more detail consider an application in which many (about 100) clients have to send streaming data (1MB data per minute) to a central server over internet. The data are needed in server for the sake of online monitoring, analysis and some computation such as machine learning and data mining tasks.

My question is about the difference between db-to-db connection and streaming solutions such as kafka for trasfering data from clients to server.


Solution

  • Prologue

    I'm going to try and break your question down into in order to get a clearer understanding of your current requirements and then build it back up again. This has taken a long time to write so I'd really appreciate it if you do two things off the back of it:

    1. Be sceptical - there's absolutely no substitute for testing things yourself. The internet is very useful as a guide but there's no guarantee that the help you receive (if this answer is even helpful!) is the best thing for your specific situation. It's impossible to completely describe your current situation in the space allotted and so any answer is, of necessity, going to be lacking somewhere.
    2. Look again at how you explained yourself - this is a valid question that's been partially stopped by a lack of clarity in your description of the system and what you're trying to achieve. Getting someone unfamiliar with your system to look over your question before posting a complex question may help.

    Problem definition

    sensory data stream from multiple clients to a central server

    You're sending data from multiple locations to a single persistence store

    online monitoring

    You're going to be triggering further actions based off the raw data and potentially some aggregated data

    analysis and some computation such as machine learning and data mining tasks

    You're going to be performing some aggregations on the clients' data, i.e. you require aggregations of all of the clients' data to be persisted (however temporarily) somewhere

    Further assumptions

    Because you're talking about materialized views we can assume that all the clients persist data in a database, probably Oracle.

    The data coming in from your clients is about the same topic.

    You've got ~100 clients, at that amount we can assume that:

    • the number of clients might change
    • you want to be able to add clients without increasing the number of methods of accessing data

    You don't work for one of Google, Amazon, Facebook, Quantcast, Apple etc.

    Architecture diagram

    Here, I'm not making any comment on how it's actually going to work - it's the start of a discussion based on my lack of knowledge of your systems. The "raw data persistence" can be files, Kafka, a database etc. This is description of the components that are going to be required and a rough guess as to how they will have to connect.

    Basic architecture diagram

    Applying assumed architecture to materialized views

    Materialized views are a persisted query. Therefore you have two choices:

    1. Create a query that unions all 100 clients data together. If you add or remove a client you must change the query. If a network issue occurs at any one of your clients then everything fails
    2. Write and maintain 100 materialized views. The Oracle database at your central location has 100 incoming connections.

    As you can probably guess from the tradeoffs you'll have to make I do not like materialized views as the sole solution. We should be trying to reduce the amount of repeated code and single points of failure.

    You can still use materialized views though. If we take our diagram and remove all the duplicated arrows in your central location it implies two things.

    1. There is a single service that accepts incoming data
    2. There is a single service that puts all the incoming data into a single place

    You could then use a single materialized view for your aggregation layer (if your raw data persistence isn't in Oracle you'll first have to put the data into Oracle).

    Consequences of changes

    Now we've decided that you have a single data pipeline your decisions actually become harder. We've decoupled your clients from the central location and the aggregation layer from our raw data persistence. This means that the choices are now yours but they're also considerably easier to change.

    Reimagining architecture

    Here we need to work out what technologies aren't going to change.

    • Oracle databases are expensive and you're pushing 140GB/day into yours (that's 50TB/year by the way, quite a bit). I don't know if you're actually storing all the raw data but at those volumes it's less likely that you are - you're only storing the aggregations
    • I'm assuming you've got some preferred technologies where your machine learning and data mining happen. If you don't then consider getting some to prevent madness supporting everything

    Putting all of this together we end up with the following. There's actually only one question that matters:

    How many times do you want to read your raw data off your database.

    If the answer to that is once then we've just described middleware of some description. If the answer is more than once then I would reconsider unless you've got some very good disks. Whether you use Kafka for this middle layer is completely up to you. Use whatever you're most familiar with and whatever you're most willing to invest the time into learning and supporting. The amount of data you're dealing with is non-trivial and there's going to be some trial and error getting this right.

    enter image description here

    One final point about this; we've defined a data pipeline. A single method of data flowing through your system. In doing so, we've increased the flexibility of the system. Want to add more clients, no need to do anything. Want to change the technology behind part of the system, as long as the interface remains the same there's no issue. Want to send data elsewhere, no problem, it's all in the raw data persistence layer.