Search code examples
redisnosqllarge-data

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset


Please comment and critique the approach.

Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields. Every week I will be getting a Delta files which will only contain changes to the data.

Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.

The approach I have planned :

  • Will parse the dataset and put it a DB(To be done at the start of the week) like MySQL or Postgres. The reason i want to have RDBMS in the first step is I want to have full time series data.

  • Then generate some kind of Key Value store out of this database with the latest valid data which supports operation to find out whether each item is present in the dataset or not(Thinking some kind of a NOSQL db, like Redis here optimised for search. Should have persistence and be distributed). This datastructure will be read-only.

  • Query this key value store to find out whether each item is present (if possible match a list of values all at once instead of matching one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API

Sidenote: Language of my preference is Python.


Solution

  • A few considerations for the fast lookup:

    • If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
    • You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
    • If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.