I'd like to find the difference between values in a Pandas groupby dataframe, but for specific column values. I've read multiple posts about using the diff
command, but that applies to subsequent rows regardless of groupings.
In the dataframe below (it's a dictionary), the dataframe has columns for user id trial_id
, a condition placebovstreatment
, a moderator variable expbin
, and a value.
I want to calculate the difference between values within users, but only if they have values for certain condition categories.
For instance, user 1 has values of
correct_placebo_baseline 10.000
correct_treatment 21.000
The difference is 11.
User 2 has values of
0 22.000
correct_placebo_baseline 8.688
The difference is roughly 14.
User 1 has a difference between column categories correct_placebo_baseline
and correct_treatment
. User 2 has a difference between, correct_placebo_baseline
and category '0'.
How do I calculate only if a user has both a correct_placebo_baseline
and a 'correct_treatment' groupings? Or, alternatively, how do you create columns where the differences are specific per group per user?
The formula could create columns difference from baseline for correct placebo
and 'difference from baseline for 0' for each trial_id.
The challenge is that some users don't have a baseline score. Some users have a baseline score but nothing else. I need difference values only if they have both.
I tried to find a way to run a function when groupby categories meet certain criteria, but couldn't.
Thanks for any help and let me know if I can make this question easier to answer.
{'trial_id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 5},
'placebovstreatment': {0: '0',
1: 'correct_placebo_baseline',
2: 'correct_treatment',
3: '0',
4: 'correct_placebo_baseline',
5: 'correct_placebo_baseline',
6: 'incorrect_placebo',
7: 'correct_placebo_baseline',
8: 'incorrect_placebo',
9: '0'},
'expbin': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 2, 7: 1, 8: 1, 9: 1},
'value': {0: 31.5,
1: 10.0,
2: 21.0,
3: 22.0,
4: 8.688,
5: 20.0,
6: 37.5,
7: 12.0,
8: 32.5,
9: 10.0}}
You can pivot
to get the conditions as columns:
df2 = df.pivot(index=['trial_id', 'expbin'], columns='placebovstreatment', values='value')
Output:
placebovstreatment 0 correct_placebo_baseline correct_treatment incorrect_placebo
trial_id expbin
1 1 31.5 10.000 21.0 NaN
2 2 22.0 8.688 NaN NaN
3 2 NaN 20.000 NaN 37.5
4 1 NaN 12.000 NaN 32.5
5 1 10.0 NaN NaN NaN
You can then easily perform computations:
df2['correct_treatment'] - df2['correct_placebo_baseline']
Output:
trial_id expbin
1 1 11.0
2 2 NaN
3 2 NaN
4 1 NaN
5 1 NaN
dtype: float64