Search code examples
javadatabaseredisiohbase

What's the best solution for making millions of small binaries available for testing consistently?


We're developing a biometric matching solution for a verification system. As you may know, one of the main issues with biometric data is that they're unstructured binaries and every single biometric minutiae must be matched with the whole minutiae database.

Hence, we're looking for a fast and appropriate solution to eliminate the binary retrieval (I/O) latency from the physical hard disk and decrease the overheads by making all the binary records available for new matching requests.

Currently, our solution is to use an in-memory database like Redis with a caching mechanism. The problem with this solution is that the size of memory (RAM) goes really big if the number of biometric minutiae binary is so high. We're looking for a solution to make all the binaries highly available for our matching application.

Take note that usually each biometric minutiae are less than 5 KB only and we have millions of biometric minutiae records.


Solution

  • You can use a combination of in-memory and disk-based DB, to store millions of minutiae.

    You can store all minutiae in any disk-based DBs like MySQL, PostgreSQL, or any other.

    Minutiae data would be spread across three different datastores.

    • Application cache (Local cache)
    • In-Memory DB (Memcache, Redis, etc)
    • Disk-based DB (MySQL, MongoDB, etc)

    Let's say you're using Redis and MySQL in your setup.

    Your code should first search for the minutiae in the application cache, if it's not found then it should search in Redis to see if it's available there, if you find there then get that and store it in the local cache with expiry.

    Even if data is not available in the Redis then you should search in the MySQL database and bring it back. If you find then you should store the same data in Redis with expiry.

    Using expiry you can avoid having all objects in the memory at the same time.

    Let's say now you don't want to use expiry as you always need all the minutiae. In such cases, you can either increase the size of your Redis instance or use the Redis cluster. As an alternative, IMDG (In-memory data grid) like Hazelcast, Apache Ignite, etc can be used to store all the minutiae. If you don't like to use such a complex setup, then you should consider using In-memory databases like Sap Hana, MemSQL, etc.