Disclaimer - this question is not work nor academy related, it is merely to understand ideas and approaches regarding big data.
Suggest that I have a database with 10Bil of records about flights all around the worlds. 10% to 20% are being updated every minute - the update could be either a change of departure/arrival or any other relevant parameter of the flight.
All the data from the DB is being updated to a cache in another machine , (let's call it : "The Cache Machine").
Thousands of clients request data from the cache machine.
My questions are as follows:
1.How can I avoid stale data at the cache machine, if the db has an updated data every minute?
2.What would be the most efficient way for the clients to call the cache machine? Is the fact that the cache machine holds a substantial amount of data and multiple clients will access the machine simultaneity, will require an asynchronous approach?
3.Should I use an RDBMS for my DB? if the data is being held in such a DB, queries from different tables could take a long time.
Attempting to answer these questions myself , I'd say that :
1.
a.I can clear the cache machine every one minute and then retrieve all the data from the DB. My data will be fresh but such a query could be painfully slow.
OR
b.I can check the state of every item at the cache periodically ,however it could choke my DB.
2.I can have a queue base requests so the clients won't interfere each-other.
3.RDBMS wouldn't be a good option for this amount of data. A Key/Value DB could work for this kind of data.
I'm unsure how should I answers these questions, and would appreciate any good points or explanation of how to deal with such a scenario.
your problem-statements are very short. I'm trying to clarify with some simplistic assumptions (please correct my assumptions if wrong and then I can tweak the answers accordingly):
Most probably this is not enough, but if you add more details I can modify accordingly.