Search code examples
redisdata-modelingcorrelationkey-value-store

Key-Value Store (Redis) for Phi Coefficient Use-Case


I want to build an application in which the user can assign multiple tags (strings) to a date (YYYY-MM-DD string). The main use-case is to calculate the Phi coefficient for a combination of two tags (A and B), which requires to put every date in one of the following categories:

  1. neither Tag is assigned
  2. Tag A is assigned, but not Tag B
  3. Tag B is assigned but not Tag A
  4. Tag A and Tag B are assigned

The crucial information is how many dates belong to each category, not what dates.

The question is, how to persist that data so that it can be looked up quickly for the categorization described above.

Using a key-value store, such as Redis, storing sets with the tags as keys and the dates for each Tag as values would be an option that makes it easy to fill the store with new information. For the lookup, the intersection (SINTER) of A and B would form the fourth category, the differences (SDIFF) between A and B, and B and A, respectively, would form the second and third category.

The question remains, how to calculate the first category: The number of dates, where neither Tag A nor B applies to. The only option that comes to my mind is to read out the dates by iterating over all the keys, and subtract the numbers of the categories 2, 3, and 4 from the total number of dates. Is there a more elegant and more efficient way to achieve this goal? Or do I better use an SQL data base for that use-case?

EDIT: Another idea would be to not only store the dates by tag, but also the tags by date in a redundant manner, so that the retrieval of all dates is easier.


Solution

  • There are two basic approaches here: store the data in a single canonical form and use that to compute derived data as you need it; or store the information in multiple ways up front to optimize lookup speed.

    Up to now you've taken the first approach. That's good, since storing information in a single place makes many things simpler, and eliminates the risk of having inconsistent data. The downside is that computing derived values can be slow. In your case you're talking about O(n) operations in the best case, with iterating over all keys in the worst case. Although it's always worth doing performance testing before making things more complex, my intuition is that you're right to be worried.

    Storing the derived data separately from the canonical data allows you to optimize your lookup performance. Your last paragraph suggests storing the same information in multiple ways, but as long as you're doing that you might as well store the actual desired derived values rather than keeping the existing dates-by-tag data structure.

    Specifically, my suggestion is to store the tags by date while separately storing the counts for categories 1-4. Each time you record a new (or changed, or deleted) input value you both update your canonical data structure and update your counts. You could probably do so atomically with a fairly simple Lua script. You can then access the desired counts in O(1) time and be confident that they accurately reflect the underlying data.