I have monthly performance of students for several years for all subjects. DataFrame has following columns: [Name, Subject, Month, Year, Marks] as given in following image 1:
Name Month Year Subject Marks
0 A 1 2022 Math 80
1 A 2 2022 Math 80
2 A 3 2022 Math 80
3 A 4 2022 Math 70
4 A 5 2022 Math 80
5 A 6 2022 Math 80
6 A 7 2022 Math 80
Now I want combine consecutive rows having same performance for given student and subject. As given in following image:
Name Subject Marks Time_Period
0 A Math 80 1.2022-3.2022
1 A Math 70 4.2022-4.2022
2 A Math 80 5.2022-7.2022
I have tried to group dataframe and extract Min/Max(Month) and Min/Max(Year). But it will give wrong result if student has different performance in month in between.
You can use a custom groupby.agg
:
# identify consecutive marks
group = df['Marks'].ne(df['Marks'].shift()).cumsum()
out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
+'.'+d['Year'].astype(str))
.groupby(['Name', 'Subject', 'Marks', group],
sort=False, as_index=False)
['Time_Period']
.agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
)
If you want to start a new group when a month is missing:
# identify consecutive marks
group1 = df['Marks'].ne(df['Marks'].shift()).cumsum()
# group by successive months
group2 = df.groupby('Year')['Month'].diff().ne(1)
out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
+'.'+d['Year'].astype(str))
.groupby(['Name', 'Subject', 'Marks', group1, group2],
sort=False, as_index=False)
['Time_Period']
.agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
)
Output:
Name Subject Marks Time_Period
0 A Math 80 1.2022-3.2022
1 A Math 70 4.2022-4.2022
2 A Math 80 5.2022-7.2022