Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields. Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the week) like MySQL or Postgres. The reason i want to have RDBMS in the first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with the latest valid data which supports operation to find out whether each item is present in the dataset or not(Thinking some kind of a NOSQL db, like Redis here optimised for search. Should have persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present (if possible match a list of values all at once instead of matching one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
SINTER
which performs set intersection.