Search code examples
pythonpandassortingaggregateweighted-average

Calculating weighted average by sorting and aggregating in a pandas dataframe


I have a manufacturing sample dataset, it has parent batches and output batches of a product. The parent batches belong to a specific output batch within the dataset, the reason we know that is because the parent batches for each output batch has the same process order number, which is a variable. I want to calculate the weighted average, for every parent batch specific to an output batch ,so for parent batch I need to aggregate the values where the process order number is same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity , take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function, are already part of the dataframe which is Value.

The function to use is enter image description here where Qi is Quantity and Qci is Value. Please have a look at the example diagram below, its for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

enter image description here

This is a sample dataframe to be used for the same:

import pandas as pd

A = pd.DataFrame({'Batch_ID': ['A', 'B', 'C', 'D', 'E', 'F'], 'Process_Order_Number': [1,1,1,2,2,2], 'Batch_type': ['parent', 'parent', 'output','parent', 'parent', 'output'],'Quantity': [10,20,15,5,25,50], 'Value': [2,3,1,4,0,1]})

Batch_ID Process_Order_Number Batch_type Quantity Value
A 1 parent 10 2
B 1 parent 20 3
C 1 output 15 1
D 2 parent 5 4
E 2 parent 25 0
F 2 output 50 1

I wrote a function to calculate the weighted average: distributions are Quantity and Weights are the value in the dataset above.

def weighted_average(distribution, weights):
    return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights),2)

weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

df1 = A[A.duplicated('Process_Order_Number', keep=False)].sort_values('Process_Order_Number')
df1.head()
df[A.groupby('Process_Order_Number')['Batch_type'].transform('nunique').ne(1)]

These sorted it but still didn't come in the form as shown in the picture above, I am trying to bring the same process order number parent batches together and then use my weighted function to calculate and store the value in another column, it needs to traverse through each process order number and for every batch type "output" needs to find the batch type "parent" so I can take the weighted function. I am still trying to see how I can incorporate my weighted function with the sort, so I don't have to do it separately! I did look at other stackoverflow questions but couldn't find something that would fit here. I could use some help! Any guidance is much appreciated.

Expected Output for the first three rows from the sample dataset:

Batch_ID Process_Order_Number Batch_type Quantity Value Weighted_Average
A 1 parent 10 2
B 1 parent 20 3
C 1 output 2.66

Solution

  • We can first calculate the weighted_average for each process_order_number that is a parent:

    mapper = df.loc[df.Batch_type.eq('parent'), :]\
               .groupby('Process_Order_Number')\
               .apply(lambda s: (s['Value'] * s['Quantity']).sum() / s['Quantity'].sum())
    

    which yields

    Process_Order_Number
    1    2.666667
    2    0.666667
    dtype: float64
    

    And then just assign to a new column:

    df.loc[df['Batch_type'].eq('output'), 'Weighted_Average'] = df['Process_Order_Number'].map(mapper)
    
      Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
    0        A                     1     parent        10      2               NaN
    1        B                     1     parent        20      3               NaN
    2        C                     1     output        15      1          2.666667
    3        D                     2     parent         5      4               NaN
    4        E                     2     parent        25      0               NaN
    5        F                     2     output        50      1          0.666667
    

    If you prefer, you can always "delete" quantity and value values, since they are blank in your expected output.

    df.loc[df['Batch_type'].eq('output'), ['Quantity', 'Value']] = np.nan
    

      Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
    0        A                     1     parent      10.0    2.0               NaN
    1        B                     1     parent      20.0    3.0               NaN
    2        C                     1     output       NaN    NaN          2.666667
    3        D                     2     parent       5.0    4.0               NaN
    4        E                     2     parent      25.0    0.0               NaN
    5        F                     2     output       NaN    NaN          0.666667