Search code examples
algorithmrandomhigh-load

Weighted pick with filters


I have a list of elements with weights:

{ id1, weight1 },
{ id2, weight2 },
...
{ idN, weightN }

Weights are small integers (say, less than 1000, often less than 50). Total number of ids in the list is less than 1000 as well. (Each id is listed only once.)

For each query I have to return a "random enough" element from the list. If I do E queries, where E is proportional to the sum of all weights, the same number of times each element element must be exactly proportional to its weight value. Note that this should work for smaller values of E (say, up to 50 * sum of weights). See also note at the end of the question.

So far so good, I'd solve this task by putting element ids into a circular list, duplicating them the weight times, then shuffling the list. Each query returns head of the list, and then increments head position.

But in this case I have one additional condition:

I have additional parameter to the query: a filter. A filter is a map of id => is_enabled. If is_enabled is false for a given id, that id should be excluded from the results. The E value in the above restriction is calculated only for enabled elements. That is, disabled element weights are to be excluded from the query.

Filters are "unique" for each query and include entries for each id in the list. (Note that this implies 2^1000 potential filter values.)

Is there a way to solve this efficiently? I need the algorithm to be efficient on a multi-server cluster.

Note 1: I want to stress that, as I believe, selecting elements totally at random (as suggested in one of the answers), without storing any state, will not work. It will give exactly proportional number of elements only on infinite number of queries. Random number generator has full right to return unfair values for a long period of time.

Note 2: This task imposes no restrictions on the quality of the randomness. Come to think about it, it is not even necessary to shuffle the list in the simple-case solution above. Good randomness is better, but not necessary at all.

Note 3: Please note that 2^1000 potential filter values does mean that I can not store anything, associated with the filter value -- it will require too much memory. I can store something for the most recent (or often used) filters, but I can't store things like item list offset, as I can't afford to lose that data.

Note 4: We can't return metainformation with the query and let clients to store the state for us (good idea anyway, thanks, Diacleticus). We can't because two clients may accidentally use the same filter (some filters are more popular than others). In this case we must use the same state for both queries. In fact, client doing more than one query is a relatively rare event.


Solution

  • Perhaps I've found a solution:

    1. Store id->number_of_queries_left, where initial value for number_of_queries_left is, say, weight * 10 (so the list is not refreshed too often -- exactly proportional requirement would be kept, I think).
    2. On each query:
      1. Pick a random id from filter, where is_enabled is true.
      2. Decrement number_of_queries_left for that id.
      3. If result is less than or equal to zero, mark that id as used and pick another one.
      4. If all values are used and none found, reinitialize id->number_of_queries_left for all ids that are enabled in the filter ("recharge").

    Looks like it should work. What do you think?

    Update 1:

    I'm worried that it looks like that I have to keep id->number_of_queries_left value separate for each filter value. I can't afford that due to memory restrictions (there are 2^1000 potential filter values). Am I right?

    Can somebody help me to understand better the consequences of shared number_of_queries_left counter, please?

    Update 2:

    Credits for the idea go to Diacleticus (see comments to this answer).

    What if we don't reset id->number_of_queries_left for all enabled items in the filter, but instead increment them by their respective weights? I think that this should fix the proportions. (Or should it?)

    The only problem is that with this algorithm each number_of_queries_left counter can go very negative. (See above, we decrement it each time we want to look at its value.)

    So, in a pessimistic case, even by incrementing all counters, we will not bring any of them above zero. This probably is OK, since we'll effectively just run the increment loop until any value will become positive.

    Update 3:

    No, we can't just run the increment loop until any value will become positive.

    This will skewer the weights: that negative part does not have "physical sense" -- it does not represent values, returned from the query.

    Thus, a hybrid approach:

    When doing "recharge", increment each counter by weight + -min(0, current_counter_value). This should be done atomically, but that looks doable.

    Still, I'm not sure that weight handling will be fair in this case.

    Comments?