I'm splitting a dataframe into two; one to get the average over a period of time, and the other to use that average on. The dataframe looks similar to the following:
ID Type Num. Hours Month
2 black 10 Jan
2 black 12 Feb
2 black 15 March
3 red 7 Jan
3 red 10 Feb
The ID's each have 24 rows, spanning over 2 years. Different ID's can have the same Type or different Type. I'd like the two split dataframes to have the same amount of different Type's in, whilst keeping all 24 of the ID's together for each unique ID.
I've tried grouping by Type and ID, together and separately, but it seems to give me only a fraction of the ID's instead of keeping them together
df1 = df.groupby('ID')['Type'].apply(lambda x: x.sample(frac=0.5))
or
df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))
and afterwards of course I would use than index to get the second split dataframe from the original.
Neither have worked the way I require.
For the output, it should be two dataframes, which should not share any ID's and have an equal amount of different Types.
So using something similar to the above, I would hopefully output a DataFrame which looks like this:
ID Type Num. Hours Month
2 black 10 Jan
2 black 12 Feb
2 black 15 March
5 yellow 17 Jan
5 yellow 21 Feb
Using that table would allow me to index on the original dataframe and give me a second table which outputs something similar to the following:
ID Type Num. Hours Month
4 black 10 Jan
4 black 12 Feb
4 black 15 March
6 yellow 22 Jan
6 yellow 27 Feb
sample takes a fraction but does not split the dataframe in two. Having obtained half of the samples, taking the other half is simple!
I am assuming your original line works as you want it to work for the first dataframe
df1 = df.groupby(['ID', 'Type']).apply(lambda x: x.sample(frac=0.5))
df2 = df[~df.index.isin(df1.index)]
Update
Based on comments; to randomly divide your ID
's over two dataframes you can use the following:
import random
unique_ids = df.ID.unique()
random.shuffle(unique_ids)
id_set_1 = unique_ids[: len(unique_ids) // 2] # take first half of list
df1 = df[df.ID.isin(id_set_1)]
df2 = df[~df.ID.isin(id_set_2)]
Beware that this could lead to two dataframes with very different sizes, depending on the number of types per ID!