Search code examples
algorithmhashcompression

Is there a compression algorithm for an array of unique keys that quickly lets me add a key, remove a key, or check for the existence of a key?


I'm thinking about developing a storage-efficient voting system. Users of my app should be able to upvote or downvote a post, undo their vote, vote again, etc. but may never register multiple votes for a single post (just like StackOverflow).

The obvious way for me to store this data would be something like this.. (I'm using names below, but assume these are unique IDs)

{
  postId: 123,
  upvotedBy: ["bob", "sally", "mark"],
  downvotedBy: ["susan", "ryan"],
  score: 1
}

In the event a post goes viral, these arrays could become quite large and might even exceed the storage limit for a single database record. This got me thinking, maybe there is a mathematical trick whereby I can avoid storing these arrays. My only requirements are

  1. quickly check if user X has registered a vote
  2. allow user X to register a vote (if he hasn't already)
  3. allow user X to deregister his vote (if has registered one)

Bad Solution

Suppose every user's id is a prime number. For example

bob: 2
sally: 3
mark: 5
susan: 7
ryan: 11

For each post, I can initialize upvotedBy and downvotedBy to 1

{
  postId: 123,
  upvotedBy: 1,
  downvotedBy: 1,
  score: 0
}

When a user upvotes a post, I simply set upvotedBy := upvotedBy * userId (and the same for when he downvotes a post). Now I can quickly check if a user has already voted for a post, and I can register and deregister votes quickly. My example above would now look like

{
  postId: 123,
  upvotedBy: 30,
  downvotedBy: 77,
  score: 1
}

Of course, this becomes highly storage inefficient once I start multiplying thousands of prime numbers together.

Question

Does a storage and processing efficient compression algorithm that achieves these requirements exist?


Solution

  • If you can check for the existence of a key, then your data structure is equivalent to storing all the keys.

    There is a lower bound on how much space it takes to do that, and there are succinct data structures that get pretty close to that lower bound in all cases. They're pretty complicated though, and are not going to be practical for your use case.

    If you insist, you can use the Roaring Bitmaps library.

    But why do you really want to compress this data in the first place? If it's to reduce network traffic or allow for efficient operation, then the solution is to store just the counts (or nothing, because normalization) with the post and move the votes themselves into a separate table, as @greybeard suggests in comments.