Search code examples
cachingrdbmsbigdatanosql

Big Data modification stardegies


Disclaimer - this question is not work nor academy related, it is merely to understand ideas and approaches regarding big data.

Suggest that I have a database with 10Bil of records about flights all around the worlds. 10% to 20% are being updated every minute - the update could be either a change of departure/arrival or any other relevant parameter of the flight.

All the data from the DB is being updated to a cache in another machine , (let's call it : "The Cache Machine").

Thousands of clients request data from the cache machine.

My questions are as follows:

1.How can I avoid stale data at the cache machine, if the db has an updated data every minute?

2.What would be the most efficient way for the clients to call the cache machine? Is the fact that the cache machine holds a substantial amount of data and multiple clients will access the machine simultaneity, will require an asynchronous approach?

3.Should I use an RDBMS for my DB? if the data is being held in such a DB, queries from different tables could take a long time.

Attempting to answer these questions myself , I'd say that :

1.

a.I can clear the cache machine every one minute and then retrieve all the data from the DB. My data will be fresh but such a query could be painfully slow.

OR

b.I can check the state of every item at the cache periodically ,however it could choke my DB.

2.I can have a queue base requests so the clients won't interfere each-other.

3.RDBMS wouldn't be a good option for this amount of data. A Key/Value DB could work for this kind of data.

I'm unsure how should I answers these questions, and would appreciate any good points or explanation of how to deal with such a scenario.


Solution

  • your problem-statements are very short. I'm trying to clarify with some simplistic assumptions (please correct my assumptions if wrong and then I can tweak the answers accordingly):

    1. Cache Update:
      • assuming you don't need to keep a copy of data in cache, but only the latest accessed datasets so that any repeated access is faster (thereby improving average access latency). The query can first search the cache and, if not found, search the DB.
      • assuming you need a push from the DB, you can have buckets with time-stamps of data push. Search query can start searching from the latest time bucket. if not found, go to the previous time-bucket. Use a bloom filter to check if the entry doesn't exist in a bucket.
      • you might have to run a background job to consolidate/ compact buckets, indexes, remove the older entries of multi-timeline entries
    2. Cache access:
      • batch-mode: go for queue. let the queries come in a queue and the resultset can also be in another queue for the client to retrieve.
      • online-mode: assuming read-only access, you can use memcached/ radis for distributed, high-performance caching (and obviously the purpose of caching is to enable low-latency queries). You can plug in a app/web server in front.
    3. DB choice:
      • assuming your cache is the access point for queries, you don't need a high-performing db. Since the data is huge, I'd think a distributed caching is needed and also a distributed db. Postgres, Hive/ HBase, MongoDB etc distributed DBs will be good.
      • you can't yet say whether RDBMS can be good or not since we don't know the type of data and the access requirement. Assuming access is through cache using a key (maybe composite key), a key-value based storage (such as HBase) is good.

    Most probably this is not enough, but if you add more details I can modify accordingly.