Search code examples
pythonsqliterandom-access

Shuffle huge table in sql3


I am currently using the sqlite3 python package to access data in a table within a DB. The table is quite huge -around 100GB- and I want to randomize its rows, since it contains training data for my NN, which needs to see the whole data several times.

Currently I am iterating through the table using the cursor, which fetches a part of the table, depending on my systems memory.

I know how to access the data randomly, but I do not know how to do it in a way, that I can access all the data in a table only once in an epoch. Thus, I am searching for a way to randomize the whole table, even though this might be computationally very expensive.

Is there an easy way to do it or do you have better suggestions? Thx!


Solution

  • This answer is not SQLite specific, as I have not used it much, but more with SQL in general.

    Two things there, either you care about reproducible results or not. If you don't, you can do something like :

    SELECT *
    FROM my_table
    ORDER BY random();
    

    Otherwise when I want reproducible shuffle with SQL in general, I usually hash the concatenation of a value, lets say '1001' with a few of the values of a row.

    SELECT MD5('1001' || my_column_1 || my_column_2) as r, my_column_1, my_column_2
    FROM my_table
    ORDER BY r;
    

    Of course this is good only if the values in your columns change.

    You can apply a limit like

    SELECT MD5('1001' || my_key_1 || my_key_2) as r, my_key_1, my_key_2
    FROM my_table
    ORDER BY r
    LIMIT 10000;
    

    To get some batches and not the whole table.

    Hope I have answered your question, have a nice day!