Search code examples
pandassumpandas-groupby

Summarize values every N columns


From this dataframe :

|-|----|---|---|---|---|---|---|---|---|
| |code|M-1|M-2|M-3|M-4|M-5|M-6|M-7|M-8|
|-|----|---|---|---|---|---|---|---|---|
|0| DE | 3 | 0 | 5 | 7 | 0 | 2 | 1 | 9 |
|1| GT | 5 | 2 | 2 | 1 | 0 | 3 | 1 | 7 |
|2| KT | 8 | 2 | 0 | 3 | 0 | 7 | 0 | 3 |
|3| SZ | 0 | 2 | 3 | 2 | 5 | 4 | 0 | 2 |
|4| NJ | 7 | 3 | 3 | 0 | 2 | 1 | 0 | 1 |
|5| DC | 1 | 0 | 3 | 0 | 8 | 1 | 0 | 0 |
|-|----|---|---|---|---|---|---|---|---|

I would like to get that :

|-|----|-----|-----|
| |code| T-1 | T-2 |
|-|----|-----|-----|
|0| DE |  8  |  9  |
|1| GT |  9  |  4  |
|2| KT | 10  | 10  |
|3| SZ |  5  | 11  |
|4| NJ | 13  |  3  |
|5| DC |  4  |  9  |
|-|----|-----|-----|

Month-1, Month-2, Month-3 are summarized in Trimester-1.

M-4, M-5, M-6 are summarized in T-2

We lack M-9 to add the column T-3...so we deleted M-7 and M-8.

In this example, the input dataframe goes till M-8 but it could have been just till M-1 or till M-12.


Solution

  • 0. The case studied

    import pandas as pd
    
    histo = {
            "article_code" : ["DE", "GT", "KT", "SZ", "NJ", "DC"],
            "M-1" : [3, 5, 8, 0, 7, 1],
            "M-2" : [0, 2, 2, 2, 3, 0],
            "M-3" : [5, 2, 0, 3, 3, 3],
            "M-4" : [7, 1, 3, 2, 0, 0],
            "M-5" : [0, 0, 0, 5, 2, 8],
            "M-6" : [2, 3, 7, 4, 1, 1],
            "M-7" : [1, 1, 0, 0, 0, 0],
            "M-8" : [9, 7, 3, 2, 1, 0]
            }
    
    df = pd.DataFrame(histo)
    print(df)
    

    1. Method vectorized (using groupby)

    # All columns must be the months we want to group
    df.set_index("article_code", inplace=True)
    print(df)
    
    # Prepare the groupby function
    m_number = len(df.columns)
    splitter = [x//3 for x in range(0, m_number)]
    print(splitter)
    
    # Sum per trimester
    df = df.groupby(by=splitter, axis=1).sum()
    print(df)
    
    # Remove non full trimester and rename columns
    t_number = m_number//3
    df = df.iloc[:,:t_number]
    df.columns = ["T-" + str(x + 1) for x in range(0,m_number//3)]
    print(df)
    

    2. Method with a loop (using iloc)

    # Record the number of months 
    m_number = len(df.columns) - 1
    
    # Add sums per full trimester
    for inc_t, inc_m in enumerate(range(1, (m_number//3)*3, 3)):
        df["T-" + str(inc_t + 1)] = df.iloc[:,inc_m:inc_m+3:1].sum(axis=1)
    print(df)
    
    # Delete months
    df = df.iloc[:,:1].merge(right=df.iloc[:,-inc_t-1:], how="left",
                             left_index=True, right_index=True)
    print(df)