Search code examples
pythoncsvrandomlarge-files

How random is a random row from random chunks of data?


I am trying to build a really simple tool in Python.

I have a list, a very large list (about 5GB) in a .csv, of raffle numbers.

For whatever reason I cannot get Pandas or even regular SQL Databases to convert this list to a table and then randomly select a number (trying to select a random winner)

So it was suggested that I break up the .csv into chunks with code (so far I have no way to even open the list).

The main question is, how random will be effected if I do this? Lets say it breaks it into 5 chunks, and then I am asking it to select a random row of data from ANY of those five chunks. Does the random outcome actually show a 100% random row of data, or is this effected by having to run random on both levels. I.E. - Randomly select one of these five chunks, then randomly select a number from within them.

If I do it that way, isn't that effecting how truly random it is? Or am I just losing my mind thinking about the statistics around it?

(Bonus question, I still have not even figured out a clear way to get it to break up the csv into manageable chunks, so any tips there would be extra awesome!)


Solution

  • The following two scenarios are equivalent:

    1. Pick a card from a deck at random
    2. Pick a suit from {clubs,hearts,spades,diamonds} at random and then pick a card from that suit.

    But, the following are not equivalent:

    1. Pick a card at random
    2. Pick a category from {face cards, non-face cards} at random and then pick a card from that category at random, since that would over-sample face cards.

    Moral of the story: you will be okay as long as the chunks are of equal size. Otherwise, you will over-sample the smaller chunks.