I have a first dataframe:
df1 = pd.DataFrame({'subject':[1,4,8,1,2,6,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,2,3,4,5,8,9,10],
'trial':[1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3]})
print(df1)
subject trial
0 1 1
1 4 1
2 8 1
3 1 2
4 2 2
5 6 2
6 9 2
7 11 2
8 12 2
9 13 2
10 14 2
11 15 2
12 16 2
13 17 2
14 18 2
15 19 2
16 20 2
17 21 2
18 22 2
19 23 2
20 24 2
21 2 3
22 3 3
23 4 3
24 5 3
25 8 3
26 9 3
27 10 3
And a second dataframe:
df2 = pd.DataFrame({'subject':[2,3],
'trial':[12,4]})
print(df2)
subject trial
0 2 12
1 3 4
I would like to remove for subject 2, all the trials until trial 12 included (trial 1,4,8,1,2,6,9,11,12 for subject 2) and for subject 3 until trial 4 included (trial 2,3,4 for subject 3)
Would it be possible to do it comparing both DataFrames or with map and drop from a dictionary?
groupby('subject')
and agg(set)
onto trial
.join
the two dataframesset
of trail_n
using range(x + 1)
since the objective is to remove all previous trials.difference
of the two sets
list
and explode
, if you want.
explode
is available from pandas 0.25import pandas as pd
# data
df1 = pd.DataFrame({'subject': [1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3],
'trial': [1,4,8,1,2,6,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,2,3,4,5,8,9,10]})
df2 = pd.DataFrame({'subject': [2, 3],
'trial': [12, 4]})
# set subject as the index in df2
df2.set_index('subject', inplace=True)
# for df1 groupby subject and agg set onto trial
df1g = pd.DataFrame(df1.groupby('subject')['trial'].agg(set))
| subject | trial |
|----------:|:---------------------------------------------------------------------|
| 1 | {8, 1, 4} |
| 2 | {1, 2, 6, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24} |
| 3 | {2, 3, 4, 5, 8, 9, 10} |
# join the dataframes
df_join = df1g.join(df2, rsuffix='_n')
| subject | trial | trial_n |
|----------:|:---------------------------------------------------------------------|----------:|
| 1 | {8, 1, 4} | nan |
| 2 | {1, 2, 6, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24} | 12 |
| 3 | {2, 3, 4, 5, 8, 9, 10} | 4 |
# this is required because set & range don't work with nan
df_join.trial_n.fillna(0, inplace=True)
# convert trial_n to a set; int is required because range doesn't work with float
df_join.trial_n = df_join.trial_n.apply(lambda x: set(range(int(x)+1)))
# take the set difference
df_join['remains'] = df_join.trial - df_join.trial_n
| subject | trial | trial_n | remains |
|----------:|:---------------------------------------------------------------------|:-------------------------------------------|:-------------------------------------------------|
| 1 | {8, 1, 4} | {0} | {8, 1, 4} |
| 2 | {1, 2, 6, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24} | {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} | {13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24} |
| 3 | {2, 3, 4, 5, 8, 9, 10} | {0, 1, 2, 3, 4} | {8, 9, 10, 5} |
# create df_final & convert remains back to a list so explode can be used
df_final = pd.DataFrame(df_join.remains.map(list).map(sorted))
| subject | remains |
|----------:|:-------------------------------------------------|
| 1 | [1, 4, 8] |
| 2 | [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24] |
| 3 | [5, 8, 9, 10] |
# explode the lists
df_final = df_final.explode('remains')
| subject | remains |
|----------:|----------:|
| 1 | 1 |
| 1 | 4 |
| 1 | 8 |
| 2 | 13 |
| 2 | 14 |
| 2 | 15 |
| 2 | 16 |
| 2 | 17 |
| 2 | 18 |
| 2 | 19 |
| 2 | 20 |
| 2 | 21 |
| 2 | 22 |
| 2 | 23 |
| 2 | 24 |
| 3 | 5 |
| 3 | 8 |
| 3 | 9 |
| 3 | 10 |