Search code examples
pythonpandassampling

Is there a way to sample on a 'type' column, whilst keeping all ID's within that type in another column?


I'm splitting a dataframe into two; one to get the average over a period of time, and the other to use that average on. The dataframe looks similar to the following:

ID    Type    Num. Hours     Month

2     black       10           Jan

2     black       12           Feb

2     black       15           March

3     red         7            Jan

3     red         10           Feb

The ID's each have 24 rows, spanning over 2 years. Different ID's can have the same Type or different Type. I'd like the two split dataframes to have the same amount of different Type's in, whilst keeping all 24 of the ID's together for each unique ID.

I've tried grouping by Type and ID, together and separately, but it seems to give me only a fraction of the ID's instead of keeping them together

df1 = df.groupby('ID')['Type'].apply(lambda x: x.sample(frac=0.5))

or

df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))

and afterwards of course I would use than index to get the second split dataframe from the original.

Neither have worked the way I require.

For the output, it should be two dataframes, which should not share any ID's and have an equal amount of different Types.

So using something similar to the above, I would hopefully output a DataFrame which looks like this:

ID    Type    Num. Hours     Month

2     black       10           Jan

2     black       12           Feb

2     black       15           March

5     yellow       17           Jan

5     yellow       21           Feb

Using that table would allow me to index on the original dataframe and give me a second table which outputs something similar to the following:

ID    Type    Num. Hours     Month

4     black       10           Jan

4     black       12           Feb

4     black       15           March

6     yellow      22            Jan

6     yellow      27            Feb

Solution

  • sample takes a fraction but does not split the dataframe in two. Having obtained half of the samples, taking the other half is simple!

    I am assuming your original line works as you want it to work for the first dataframe

    df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))
    df2 = df[~df.index.isin(df1.index)]
    

    Update

    Based on comments; to randomly divide your ID's over two dataframes you can use the following:

    import random
    
    unique_ids = df.ID.unique()
    random.shuffle(unique_ids)
    id_set_1 = unique_ids[: len(unique_ids) // 2]  # take first half of list
    
    df1 = df[df.ID.isin(id_set_1)]
    df2 = df[~df.ID.isin(id_set_2)]
    

    Beware that this could lead to two dataframes with very different sizes, depending on the number of types per ID!