I am trying to build a really simple tool in Python.
I have a list, a very large list (about 5GB) in a .csv, of raffle numbers.
For whatever reason I cannot get Pandas or even regular SQL Databases to convert this list to a table and then randomly select a number (trying to select a random winner)
So it was suggested that I break up the .csv into chunks with code (so far I have no way to even open the list).
The main question is, how random will be effected if I do this? Lets say it breaks it into 5 chunks, and then I am asking it to select a random row of data from ANY of those five chunks. Does the random outcome actually show a 100% random row of data, or is this effected by having to run random on both levels. I.E. - Randomly select one of these five chunks, then randomly select a number from within them.
If I do it that way, isn't that effecting how truly random it is? Or am I just losing my mind thinking about the statistics around it?
(Bonus question, I still have not even figured out a clear way to get it to break up the csv into manageable chunks, so any tips there would be extra awesome!)
The following two scenarios are equivalent:
But, the following are not equivalent:
Moral of the story: you will be okay as long as the chunks are of equal size. Otherwise, you will over-sample the smaller chunks.