Search code examples
c#sqldatabasedatabase-performance

Fastest weighted probability table design in SQL


I have a table that looks like the following:

Id (int), Input (string), Output (string)

Given some input X, I need to be able to get a random output Y but weighted according to the frequency that X-Y occurs in the table.

Here are some sample rows:

 1. In1, Out1
 2. In1, Out2
 3. In1, Out2
 4. In2, Out3
 5. In2, Out4

So in this case, for 'In1' my possible randomly generated outputs are either Out1 or Out2. It would be twice as likely to generate 'Out2' since the In1-Out2 transition occurs twice in the database as opposed to In1-Out1 which only appears once. For In2 either Out3 or Out4 will be generated with an equal probability.

Which of the following would be more performant? (or is there a third approach that I'm overlooking)

  1. Duplicating rows where the Input and Output values are identical

Then we just make a single SQL call: (depending on mysql or mssql)

select * from table where Input = X order by rand() limit 1;
select top 1 * from table where Input = X order by NEWID();
  1. Storing the frequency of Input X to Input Y in the table

So now the table has an additional column: Frequency

The previous table would instead look like this:

 1. In1, Out1, 1
 2. In1, Out2, 2
 3. In2, Out3, 1
 4. In2, Out4, 1

My table will be a lot smaller but it seems like whenever I want to get a weighted random row for some input value X, I'll need to first fetch all rows where Input = X into memory and then do the probability test in code.

I'm going to be making this weighted fetch thousands of time per second, so speed is of the utmost importance. The table is likely going to contain greater than a million records.

The program is written C# and it'll be using either SQL Server or MySQL as the backend, not sure if that's going to make much of a difference.


Solution

  • A third (likely the fastest approach) would be to take a variant of the second approach, but using a decimal number range like so:

    1. In1, Out1, 0, 0.33
    2. In1, Out2, 0.33, 1
    3. In2, Out3, 0, 0.5
    4. In2, Out4, 0.5, 1

    To "pick" one weighted, choose any input plus a random number between 0 and 1 (exclusive) and in your SQL query check this random number for >= min and < max. That can be optimized perfectly and takes into account the weight.

    You can make sure that the numbers are correctly distributed on insert by using a trigger.