Search code examples
pythonpandasdataframedata-processing

Pandas: Combine consecutive months having same values in other column


I have monthly performance of students for several years for all subjects. DataFrame has following columns: [Name, Subject, Month, Year, Marks] as given in following image 1:

  Name  Month  Year Subject  Marks
0    A      1  2022    Math     80
1    A      2  2022    Math     80
2    A      3  2022    Math     80
3    A      4  2022    Math     70
4    A      5  2022    Math     80
5    A      6  2022    Math     80
6    A      7  2022    Math     80

Now I want combine consecutive rows having same performance for given student and subject. As given in following image:

  Name Subject  Marks    Time_Period
0    A    Math     80  1.2022-3.2022
1    A    Math     70  4.2022-4.2022
2    A    Math     80  5.2022-7.2022

I have tried to group dataframe and extract Min/Max(Month) and Min/Max(Year). But it will give wrong result if student has different performance in month in between.


Solution

  • You can use a custom groupby.agg:

    # identify consecutive marks
    group = df['Marks'].ne(df['Marks'].shift()).cumsum()
    
    out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
                                      +'.'+d['Year'].astype(str))
             .groupby(['Name', 'Subject', 'Marks', group],
                      sort=False, as_index=False)
             ['Time_Period']
             .agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
          )
    

    If you want to start a new group when a month is missing:

    # identify consecutive marks
    group1 = df['Marks'].ne(df['Marks'].shift()).cumsum()
    
    # group by successive months
    group2 = df.groupby('Year')['Month'].diff().ne(1)
    
    out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
                                      +'.'+d['Year'].astype(str))
             .groupby(['Name', 'Subject', 'Marks', group1, group2],
                      sort=False, as_index=False)
             ['Time_Period']
             .agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
          )
    

    Output:

      Name Subject  Marks    Time_Period
    0    A    Math     80  1.2022-3.2022
    1    A    Math     70  4.2022-4.2022
    2    A    Math     80  5.2022-7.2022