Search code examples
phpmysqlrandomrowsnonsequential

High performance MySQL random non-sequential row


I'm trying to grab a random row from a table where the data doesn't change. I've read that people try ORDER BY RAND() which is terrible for large datasets and doesn't scale well.

I've also seen the solution that is to get SQL to get a random row between a minimum/maximum range like so: FLOOR(MAX(needed_id) * RAND) but this would only work when the rows are sequential: 1,2,3,4,5,6,7,8,9,10.

The data I need to pull out isn't sequential, for example: 1,2,3,4,10,11,12,13

So I'm thinking there are two solutions:

1st Solution: Keep running this: FLOOR(MAX(needed_id) * RAND) until I receive a row of the right type (1/6 chance)

2nd Solution: Create a duplicate table (as my data never changes) like so:

temp_id | needed_id | type 
1            1          1
2            4          1
3            7          2
3            8          2

So I can pull out a random temp_id using this method: FLOOR(MAX(temp_id) * RAND) - WHERE type = 1

What do you think? I'm likely to run the 1st solution about 6 times until I receive the correct row, but in the 2nd solution it would work straight away but requires another table.


Solution

  • Your statement

    but this would only work when the rows are sequential:

    is not compeletely correct: The floor() and max() examples do work on non-sequential rows, because you would do somehting like

    WHERE id >= FLOOR(RAND()*MAX(id)) LIMIT 1

    So you take the closest ID to the random hit you're getting.

    This does have a slight preference for hits that are directly after a big gap in your sequence, but that might not be too bad, depending on your dataset.

    So depending on how much problem you'd have with this slight preference, how your dataset is, etc etc this could still be the best sollution.

    Because it is unclear to some, the usage of the functions is not a problem:

    MAX is quick on an indexed field. You don't need to count all the rows (slow on innoDB), you just need to traverse your BTREE index, so you'll find this value in log time. This is near-instant

    FLOOR is just a mathematical function which will perform in linear time. Just as RAND. Mind you that ORDER BY rand() is not slow because of rand, but because you need to order the complete table! this is not a problem of rand, but of order.

    Now you have a query that does something like:

    WHERE id >= 48 LIMIT 1
    

    Which is ofcourse very quick on an indexed field. Remember that you got that 48 (an example) not by doing any sort of table scan.