I am currently using the sqlite3 python package to access data in a table within a DB. The table is quite huge -around 100GB- and I want to randomize its rows, since it contains training data for my NN, which needs to see the whole data several times.
Currently I am iterating through the table using the cursor, which fetches a part of the table, depending on my systems memory.
I know how to access the data randomly, but I do not know how to do it in a way, that I can access all the data in a table only once in an epoch. Thus, I am searching for a way to randomize the whole table, even though this might be computationally very expensive.
Is there an easy way to do it or do you have better suggestions? Thx!
This answer is not SQLite specific, as I have not used it much, but more with SQL in general.
Two things there, either you care about reproducible results or not. If you don't, you can do something like :
SELECT *
FROM my_table
ORDER BY random();
Otherwise when I want reproducible shuffle with SQL in general, I usually hash the concatenation of a value, lets say '1001' with a few of the values of a row.
SELECT MD5('1001' || my_column_1 || my_column_2) as r, my_column_1, my_column_2
FROM my_table
ORDER BY r;
Of course this is good only if the values in your columns change.
You can apply a limit like
SELECT MD5('1001' || my_key_1 || my_key_2) as r, my_key_1, my_key_2
FROM my_table
ORDER BY r
LIMIT 10000;
To get some batches and not the whole table.
Hope I have answered your question, have a nice day!