I have a very large table (more than 10M or even 100M records) with this schema:
id int primary key, rule int
and want to select a random entry per rule. I tried this query but this takes a long time (treenode is the name of the table):
SELECT tmp.id,tmp.rule FROM treenode
LEFT JOIN (SELECT * FROM treenode ORDER BY RAND()) tmp ON (treenode.rule = tmp.rule)
GROUP BY tmp.rule;
Keeping the data as a hashtable in the memory takes a huge memory. Another option is to fetch each group from database and select a random entry. Again as the number of groups are about 100k, sending these number of queries to the database takes a long time.
update: I may add that this table is only filled once and there will be no change on it. The id and rule have holes in them.
I found out that going through all the entries take less time than this query. So I added a column as rule*max(id)+id and created an index on it (Should I use a view?).
I run the following query:
SELECT id,rule,temp FROM treenode where temp>? ORDER BY temp LIMIT 0,100000;
At the client go through all returned entries and fill a buffer. Whenever the rule changes I select a random item from the buffer and clear it (put index=0). Then I run the query again with ? as the value of the last returned temp value.