I have a dataframe with 100 000 rows contains Country, State, bill_ID, item_id, dates etc... columns I want to random sample 5k lines out of 100k lines which should have atleast one bill_ID from all countries and state. In short it should cover all countries and states with atleast one bill_ID.
Note: bill_ID contains multiple item_id
I am doing testing on a sampled data which should cover all unique countries and states with there bill_IDs.
You could use Pandas' .sample
method. With df
your dataframe try:
sample_size = 5_000
df_sample_1 = df.groupby(["Country", "State"]).sample(1)
sample_size_2 = max(sample_size - df_sample_1.shape[0], 0)
df_sample_2 = df.loc[df.index.difference(df_sample_1.index)].sample(sample_size_2)
df_sample = pd.concat([df_sample_1, df_sample_2]).sort_index()
First group by columns Country
and State
and draw samples of size 1. This gives you a sample df_sample_1
that covers each Country
-State
-combination exactly once. Then draw the rest from the dataframe that doesn't contain the first sample: df_sample_2
. Finally concatenate both samples (and sort the result if needed).