Search code examples
pythoncsvdataframeredisdistributed

Convert CSV table to Redis data structures


I am looking for a method/data structure to implement an evaluation system for a binary matcher for a verification.

This system will be distributed over several PCs.

Basic idea is described in many places over the internet, for example, in this document: https://precisebiometrics.com/wp-content/uploads/2014/11/White-Paper-Understanding-Biometric-Performance-Evaluation.pdf

This matcher, that I am testing, takes two data items as an input and calculates a matching score that reflects their similarity (then a threshold will be chosen, depending on false match/false non-match rate).

Currently I store matching scores along with labels in CSV file, like following:

label1, label2, genuine, 0.1
label1, label4, genuine, 0.2
... 
label_2, label_n+1, impostor, 0.8
label_2, label_n+3, impostor, 0.9
...
label_m, label_m+k, genuine, 0.3
...

(I've got a labeled data base)

Then I run a python script, that loads this table into Pandas DataFrame and calculates FMR/FNMR curve, similar to the one, shown in figure 2 in the link above. The processing is rather simple, just sorting the dataframe, scanning rows from top to bottom and calculating amount of impostors/genuines on rows above and below each row.

The system should also support finding outliers in order to support matching algorithm improvement (labels of pairs of data items, produced abnormally large genuine scores or abnormally small impostor scores). This is also pretty easy with the DataFrames (just sort and take head rows).

Now I'm thinking about how to store the comparison data in RAM instead of CSV files on HDD.

I am considering Redis in this regard: amount of data is large, and several PCs are involved in computations, and Redis has a master-slave feature that allows it quickly sync data over the network, so that several PCs have exact clones of data. It is also free.

However, Redis does not seem to me to suit very well for storing such tabular data.

Therefore, I need to change data structures and algorithms for their processing. However, it is not obvious for me, how to translate this table into Redis data structures.

Another option would be using some other data storage system instead of Redis. However, I am unaware of such systems and will be grateful for suggestions.


Solution

  • You need to learn more about Redis to solve your challenges. I recommend you give https://try.redis.io a try and then think about your questions.

    TL;DR - Redis isn't a "tabular data" store, it is a store for data structures. It is up to you to use the data structure(s) that serves your query(ies) in the most optimal way.

    IMO what you want to do is actually keep the large data (how big is it anyway?) on slower storage and just store the model (FMR curve computations? Outliers?) in Redis. This can almost certainly be done with the existing core data structures (probably Hashes and Sorted Sets in this case), but perhaps even more optimally with the new Modules API. See the redis-ml module as an example of serving machine learning models off Redis (and perhaps your use case would be a nice addition to it ;))

    Disclaimer: I work at Redis Labs, home of the open source Redis and provider of commercial solutions that leverage on it, including the above mentioned module (open source, AGPL licensed).