Search code examples
pythonpandassumaggregationnumeric

How to summarize values in DataFrame between defined values in column in Python Pandas?


I have DataFrame in Python Pandas like below:

Data types:

  • MONTH_NR - numeric

  • MONTH_NAME - object

  • VALUE - numeric

    MONTH_NR MONTH_NAME VALUE
    1 JANUARY 10
    2 FEBRYARY 20
    3 MARCH 15
    4 APRIL 10
    5 MAY 11
    6 JUN 100
    7 JULY 200
    8 AUGUST 12
    9 SEPTEMBER 20
    10 OCTOBER 50
    11 NOVEMBER 30
    12 DECEMBER 50

And I need to add 3 new ROWS where will be:

  1. sum of values from column "VALUE" after 6 month (from 1 till 6)
  2. sum of values from column "VALUE" after 12 month (from 7 till 12)
  3. sum of values from column "VALUE" after 12 month (from 1 till 12)

So as a result I need somethin like below:

    MONTH_NR | MONTH_NAME  | VALUE
    ---------|-------------|---------
    1        | JANUARY     |  10
    2        | FEBRYARY    |  20
    3        | MARCH       |  15
    4        | APRIL       |  10
    5        | MAY         |  11
    6        | JUN         |  100
SUM_AFTER_1_6|             |  166
    7        | JULY        |  200
    8        | AUGUST      |  12
    9        | SEPTEMBER   |  20
    10       | OCTOBER     |  50
    11       | NOVEMBER    |  30
    12       | DECEMBER    |  50
SUM_AFTER_7_12|             |  362
SUM_ALL      |             |  528

How can I do that in Python Pandas ?


Solution

  • Here a nested list aaa is created in which the name and two indexes for the range in which the sums will be turn out through loc(note that slices in loc are taken inclusive. For example df.loc[0:5, 'VALUE'] would take rows 0 through 5, that is six rows.).

    Further, in the List comprehension, a nested list is also created where the name, an empty string and the sum in each value inside the list. With np.insert, rows are inserted and the dataframe is overwritten and the original column names are set.

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('df1.csv', header=0)
    
    aaa = [['SUM_AFTER_1_6', 0, 5], ['SUM_AFTER_7_12', 6, 11], ['SUM_ALL', 0, 11]]
    bbb = [[aaa[i][0], '', df.loc[aaa[i][1]:aaa[i][2], 'VALUE'].sum()] for i in range(len(aaa))]
    
    df = pd.DataFrame(np.insert(df.values, [6, 12, 12], values=bbb, axis=0))
    
    df.rename(columns={0: 'MONTH_NR', 1: 'MONTH_NAME', 2: 'VALUE'}, inplace=True)
    
    print(df)
    

    Output

              MONTH_NR MONTH_NAME VALUE
    0                1    JANUARY    10
    1                2   FEBRYARY    20
    2                3      MARCH    15
    3                4      APRIL    10
    4                5        MAY    11
    5                6        JUN   100
    6    SUM_AFTER_1_6              166
    7                7       JULY   200
    8                8     AUGUST    12
    9                9  SEPTEMBER    20
    10              10    OCTOBER    50
    11              11   NOVEMBER    30
    12              12   DECEMBER    50
    13  SUM_AFTER_7_12              362
    14         SUM_ALL              528