Search code examples
randomstatisticsprobabilitystatistical-sampling

Sampling on a aggregated dataset


Input is a dataset where every row contains for an event, say click. The member ID is a unique ID. sample data: M1,100 M2,100 M3,50 M4,50 The goal is to sample 1% of the clicks, where total clicks are given by summing up all clicks across all member IDs. If I wish to sample 1% on the sample dataset, I wish to apply a technique that samples the click counts randomly and produces 1% or 3 clicks, something like: M1, 1 M2, 1 M4, 1 or some other combination where the sum of clicks across members is 1%.

One basic approach is to explode all entries in the input and have as the data, then sample 1% from it. This would be very slow/inefficient if there are millions of members with 100s of click counts. Looking for a better solution where no explosion of data is needed?


Solution

  • It seems like the obvious thing to do is to sample from users, with probability of each user proportional to the number of clicks for them, and then select a click uniformly at random for the given user. In the example you gave, that means select users with probabilities 100/300, 100/300, 50/300, and 50/300, and then select a click from the given user.

    You can sample proportional to weights (100/300, 100/300, 50/300, 50/300 here) by generating a random number p between 0 and 1 and then finding the smallest k (k = 1, 2, 3, ... #weights) such that the sum of the weights from 1 to k is less than or equal to p.

    An efficient way to find k is construct a list of the partial sums of weights (i.e. 0, w1, w1 + w2, w1 + w2 + w3, ...) and then carry out a binary search (not linear) on that list. A binary search will yield time per sample which grows logarithmically with the number of weights (users in your case), while linear search yields linear growth.

    EDIT: An example. Given n = 10 users with N = (100, 160, 200, 20, 500, 550, 400, 300, 120, 80) events, respectively. Total events = 2430, and weights w = (10/243, 16/243, 20/243, 2/243, 50/243, 55/243, 40/243, 10/81, 4/81, 8/243). Partial sums of weights S = (0, 10/243, 26/243, 46/243, 16/81, 98/243, 17/27, 193/243, 223/243, 235/243, 1). (NOTE: I was mistaken before; the sequence should be (0, w1, w1 + w2, w1 + w2 + w3, ..., w1 + ... + w[n - 1], 1).)

    Given a random number x between 0 and 1, find (by binary search) the index of the partial sum such that S[i] <= x < S[i + 1]. Then select an event uniformly at random from the N[i] events for user i.

    I assume that you can carry out the binary search and the sampling from the per-user events so I won't write out that part.

    EDIT2: Fixed up formula for list of partial sums. The list has n + 1 elements; searching for i such that S[i] <= x < S[i + 1] will therefore yield i = 1, 2, 3, ..., n. The final element, 1, won't ever be selected, assuming the random number is always less than 1.