I have a dataframe which kind of looks like this:
date holiday item_cnt_day shop_id cnt_sem cnt_mes cnt_year
0 2013-01-01 1 0.0 59 0.000000 0.000000 0.000000
1 2013-01-02 1 0.0 59 0.000000 0.000000 0.000000
2 2013-01-03 1 0.0 59 0.000000 0.000000 0.000000
3 2013-01-04 1 0.0 59 0.000000 0.000000 0.000000
4 2013-01-05 0 0.0 59 0.000000 0.000000 0.000000
...... ... ... ... ... ... ...
1029 2015-10-27 0 4.0 36 1.142857 0.321429 0.024658
1030 2015-10-28 0 1.0 36 1.285714 0.357143 0.027397
1031 2015-10-29 0 1.0 36 1.142857 0.392857 0.030137
1032 2015-10-30 0 4.0 36 1.714286 0.535714 0.041096
1033 2015-10-31 0 1.0 36 1.857143 0.571429 0.043836
dates start from 2013-01-01 to 2015-10-31 and this range of dates is for every shop_id, this is, for every shop_id I have that range for the dates, therefore, I have duplicate dates, but what I'm trying to do is to only have those dates which comes after the first 365 days per shop_id and I'm trying to do that by using this function:
def no_todos(df, shops):
# shops is a list of shops and there are 60 shops in this list
# df is the dataframe to be operated in the loop
new_df = pd.DataFrame(df)
# Here I'm trying to only keep those observations which come after the first 365 days for each shop
for t in shops:
new_df['shop_id'][t] = df[365::]
return new_df
However, I'm having this error: Reindexing only valid with uniquely valued Index objects. Does anyone knows how to solve this? Thanks in advance.
First sort the dataframe, then do a groupby and then do negative tail. Negative tail is not implemented in groupby tail
method so you need to create your own function. This will skip the first rows of each group
df.sort_values(['shop_id', 'date'], ascending=[True, True])
def negative_tail(group, n):
return group._selected_obj[group.cumcount(ascending=True) >= n]
final_result = negative_tail(df.groupby('shop_id'), 365).copy()