I have panel data (repeated observations per ID at different points in time). Data is unbalanced (there are gaps). I need to check and possibly adjust for a change in variable per person over the years.
I tried two versions. First, a for
loop-setting, to first access each person and each of its years. Second, a one line combination with groupby
. Groupby looks more elegant to me. Here the main issue is to identify the "next element". I assume in a loop I can solve this with a counter.
Here is my MWE panel data:
import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df
Here is what a time series per person looks like:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, ax = plt.subplots()
for i in df.id.unique():
df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
plt.xticks(np.unique(df.year),rotation=45)
Here is what I want to achieve: For each person, compare the time series of values and drop every successor who is different from its precursor value (identify red circles). Then I will try different strategies to handle it:
Solution to drop
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)
Idea to smooth
# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])
To get the next event in your database you can use a combination with groupby
and shift
and then do the subraction to the previos event:
df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']