Search code examples
python-3.xpandasiterationpandas-groupbymulti-index

python - Iterating over multi-index pandas dataframe


I´m trying to iterate over a huge pandas dataframe (over 370.000 rows) based on the index.

For each row the code should look back on the last 12 entries of this index (if available) and sum up based on (running) quarters / semesters / year.

If there is no information or not enough information (only last 3 months) then the code should consider the other months / quarters as 0.

Here is a sample of my dataframe:

enter image description here

This is the expected output:

enter image description here

So looking at DateID "1" we don´t have any other information for this row. DateID "1" is the last month in this case (month 12 so to say) and therefore in Q4 and H2. All other previous month are not existing and therefore not considered.

I already found a working solution but its very inefficient and takes a huge amount of time that is not acceptable.

Here is my code sample:

for company_name, c in df.groupby('Account Name'):
    for i, row in c.iterrows():
        i += 1
        if i < 4:              
            q4 = c.iloc[:i]['Value$'].sum()
            q3 = 0
            q2 = 0
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        elif 3 < i < 7:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[:i-3]['Value$'].sum()
            q2 = 0
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        elif 6 < i < 10:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[:i-6]['Value$'].sum()
            q1 = 0
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1
        elif 9 < i < 13:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[i-9:i-6]['Value$'].sum()
            q1 = c.iloc[:i-9]['Value$'].sum()
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1
        else:
            q4 = c.iloc[i-3:i]['Value$'].sum()
            q3 = c.iloc[i-6:i-3]['Value$'].sum()
            q2 = c.iloc[i-9:i-6]['Value$'].sum()
            q1 = c.iloc[i-12:i-9]['Value$'].sum()
            h2 = q4 + q3
            h1 = q2 + q1
            year = q4 + q3 + q2 + q1

        new_df = new_df.append({'Account Name':row['Account Name'], 'DateID': row['DateID'],'Q4':q4,'Q3':q3,'Q2':q2,'Q1':q1,'H1':h1,'H2':h2,'Year':year},ignore_index=True)

As I said I´m looking for a more efficient way to calculate these numbers as I have almost 10.000 Account Names and 30 Date ID´s per Account.

Thanks a lot!


Solution

  • If I got you right, this should calculate your figures:

    grouped= df.groupby('Account Name')['Value$']
    last_3= grouped.apply(lambda ser: ser.rolling(window=3, min_periods=1).sum())
    last_6= grouped.apply(lambda ser: ser.rolling(window=6, min_periods=1).sum())
    last_9= grouped.apply(lambda ser: ser.rolling(window=9, min_periods=1).sum())
    last_12= grouped.apply(lambda ser: ser.rolling(window=12, min_periods=1).sum())
    
    df['Q4']= last_3
    df['Q3']= last_6  - last_3
    df['Q2']= last_9  - last_6
    df['Q1']= last_12 - last_9
    df['H1']= df['Q1'] + df['Q2']
    df['H2']= df['Q3'] + df['Q4']
    

    This outputs:

    Out[19]: 
       Account Name  DateID  Value$     Q4     Q3     Q2     Q1     H1     H2
    0             A       0      33   33.0    0.0    0.0    0.0    0.0   33.0
    1             A       1      20   53.0    0.0    0.0    0.0    0.0   53.0
    2             A       2      24   77.0    0.0    0.0    0.0    0.0   77.0
    3             A       3      21   65.0   33.0    0.0    0.0    0.0   98.0
    4             A       4      22   67.0   53.0    0.0    0.0    0.0  120.0
    5             A       5      31   74.0   77.0    0.0    0.0    0.0  151.0
    6             A       6      30   83.0   65.0   33.0    0.0   33.0  148.0
    7             A       7      23   84.0   67.0   53.0    0.0   53.0  151.0
    8             A       8      11   64.0   74.0   77.0    0.0   77.0  138.0
    9             A       9      35   69.0   83.0   65.0   33.0   98.0  152.0
    10            A      10      32   78.0   84.0   67.0   53.0  120.0  162.0
    11            A      11      31   98.0   64.0   74.0   77.0  151.0  162.0
    12            A      12      32   95.0   69.0   83.0   65.0  148.0  164.0
    13            A      13      20   83.0   78.0   84.0   67.0  151.0  161.0
    14            A      14      15   67.0   98.0   64.0   74.0  138.0  165.0
    15            B       0      44   44.0    0.0    0.0    0.0    0.0   44.0
    16            B       1      43   87.0    0.0    0.0    0.0    0.0   87.0
    17            B       2      31  118.0    0.0    0.0    0.0    0.0  118.0
    18            B       3      10   84.0   44.0    0.0    0.0    0.0  128.0
    19            B       4      13   54.0   87.0    0.0    0.0    0.0  141.0
    20            B       5      20   43.0  118.0    0.0    0.0    0.0  161.0
    21            B       6      28   61.0   84.0   44.0    0.0   44.0  145.0
    22            B       7      14   62.0   54.0   87.0    0.0   87.0  116.0
    23            B       8      20   62.0   43.0  118.0    0.0  118.0  105.0
    24            B       9      41   75.0   61.0   84.0   44.0  128.0  136.0
    25            B      10      39  100.0   62.0   54.0   87.0  141.0  162.0
    26            B      11      46  126.0   62.0   43.0  118.0  161.0  188.0
    27            B      12      26  111.0   75.0   61.0   84.0  145.0  186.0
    28            B      13      24   96.0  100.0   62.0   54.0  116.0  196.0
    29            B      14      34   84.0  126.0   62.0   43.0  105.0  210.0
    32            C       2      12   12.0    0.0    0.0    0.0    0.0   12.0
    33            C       3      15   27.0    0.0    0.0    0.0    0.0   27.0
    34            C       4      45   72.0    0.0    0.0    0.0    0.0   72.0
    35            C       5      22   82.0   12.0    0.0    0.0    0.0   94.0
    36            C       6      48  115.0   27.0    0.0    0.0    0.0  142.0
    37            C       7      45  115.0   72.0    0.0    0.0    0.0  187.0
    38            C       8      11  104.0   82.0   12.0    0.0   12.0  186.0
    39            C       9      27   83.0  115.0   27.0    0.0   27.0  198.0
    

    For the following test data:

    data= {'Account Name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
     'DateID': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3, 4, 5, 6, 7, 8, 9],
     'Value$': [33, 20, 24, 21, 22, 31, 30, 23, 11, 35, 32, 31, 32, 20, 15, 44, 43, 31, 10, 13, 20, 28, 14, 20, 41, 39, 46, 26, 24, 34, 12, 15, 45, 22, 48, 45, 11, 27]
    }
    
    df= pd.DataFrame(data)
    

    Edit:: If you want to count the unique entires over the same period, you can do that as follows:

    def get_nunique(np_array):
        unique, counts= np.unique(np_array, return_counts=True)
        return len(unique)
    
    df['Category'].rolling(window=3, min_periods=1).apply(get_nunique)