Search code examples
pythonpandasdataframerandomgroup-by

Random Sample data based on other columns using python


I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state. In short it should cover all countries and states with atleast one bill_ID.

Note: bill_ID contains multiple item_id

I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.


Solution

  • You could use Pandas' .sample method. With df your dataframe try:

    sample_size = 5_000
    df_sample_1 = df.groupby(["Country", "State"]).sample(1)
    sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
    df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
    df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()
    

    First group by columns Country and State and draw samples of size 1. This gives you a sample df_sample_1 that covers each Country-State-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2. Finally concatenate both samples (and sort the result if needed).