Search code examples
pythonpython-3.xpandassample

generate output files with random samples from pandas dataframe


I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.

for example:

df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))

I can remove a random sample of 100 rows and output a file with time stamp:

df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)

the above works but if I try to apply it to the entire example dataframe:

while len(df)>0:
        df_subset=df.sample(100)
        df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
        df=df.drop(df_subset.index)

it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.

any guidance would be appreciated

UPDATE

this to gets me almost there:

for i in range(4):
        df_subset=df.sample(100)
        df=df.drop(df_subset.index)
        time.sleep(1) #added because runs too fast for unique naming
        df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')

it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.


Solution

  • After running your code, I think the problem is not with df.drop but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.

    I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case

    while len(df)>0:
      df_subset=df.sample(100)
       
      df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
      df=df.drop(df_subset.index)