I'm thinking about developing a storage-efficient voting system. Users of my app should be able to upvote or downvote a post, undo their vote, vote again, etc. but may never register multiple votes for a single post (just like StackOverflow).
The obvious way for me to store this data would be something like this.. (I'm using names below, but assume these are unique IDs)
{
postId: 123,
upvotedBy: ["bob", "sally", "mark"],
downvotedBy: ["susan", "ryan"],
score: 1
}
In the event a post goes viral, these arrays could become quite large and might even exceed the storage limit for a single database record. This got me thinking, maybe there is a mathematical trick whereby I can avoid storing these arrays. My only requirements are
Suppose every user's id is a prime number. For example
bob: 2
sally: 3
mark: 5
susan: 7
ryan: 11
For each post, I can initialize upvotedBy
and downvotedBy
to 1
{
postId: 123,
upvotedBy: 1,
downvotedBy: 1,
score: 0
}
When a user upvotes a post, I simply set upvotedBy := upvotedBy * userId
(and the same for when he downvotes a post). Now I can quickly check if a user has already voted for a post, and I can register and deregister votes quickly. My example above would now look like
{
postId: 123,
upvotedBy: 30,
downvotedBy: 77,
score: 1
}
Of course, this becomes highly storage inefficient once I start multiplying thousands of prime numbers together.
Does a storage and processing efficient compression algorithm that achieves these requirements exist?
If you can check for the existence of a key, then your data structure is equivalent to storing all the keys.
There is a lower bound on how much space it takes to do that, and there are succinct data structures that get pretty close to that lower bound in all cases. They're pretty complicated though, and are not going to be practical for your use case.
If you insist, you can use the Roaring Bitmaps library.
But why do you really want to compress this data in the first place? If it's to reduce network traffic or allow for efficient operation, then the solution is to store just the counts (or nothing, because normalization) with the post and move the votes themselves into a separate table, as @greybeard suggests in comments.