python pandas sorting aggregate weighted-average

Calculating weighted average by sorting and aggregating in a pandas dataframe

I have a manufacturing sample dataset, it has parent batches and output batches of a product. The parent batches belong to a specific output batch within the dataset, the reason we know that is because the parent batches for each output batch has the same process order number, which is a variable. I want to calculate the weighted average, for every parent batch specific to an output batch ,so for parent batch I need to aggregate the values where the process order number is same for both parent and output batch. I want to go through each output batch with the same process number, find the parent batch with the same process number and aggregate the Quantity , take a sum for the denominator and perform the weighted function formula and store the value in another column named, "weighted feature". The other values to perform a weighted function, are already part of the dataframe which is Value.

The function to use is where Qi is Quantity and Qci is Value. Please have a look at the example diagram below, its for a specific order number and to demonstrate the various parent(input) and output batches, this may help in understanding what I am trying to do!

This is a sample dataframe to be used for the same:

import pandas as pd

A = pd.DataFrame({'Batch_ID': ['A', 'B', 'C', 'D', 'E', 'F'], 'Process_Order_Number': [1,1,1,2,2,2], 'Batch_type': ['parent', 'parent', 'output','parent', 'parent', 'output'],'Quantity': [10,20,15,5,25,50], 'Value': [2,3,1,4,0,1]})

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value
A	1	parent	10	2
B	1	parent	20	3
C	1	output	15	1
D	2	parent	5	4
E	2	parent	25	0
F	2	output	50	1

I wrote a function to calculate the weighted average: distributions are Quantity and Weights are the value in the dataset above.

def weighted_average(distribution, weights):
    return round(sum([distribution[i]*weights[i] for i in range(len(distribution))])/sum(weights),2)

weighted_average(distribution, weights)

Next, I tried to aggregate the data by using the following methods but I wasn't able to get the specific cluster:

df1 = A[A.duplicated('Process_Order_Number', keep=False)].sort_values('Process_Order_Number')
df1.head()

df[A.groupby('Process_Order_Number')['Batch_type'].transform('nunique').ne(1)]

These sorted it but still didn't come in the form as shown in the picture above, I am trying to bring the same process order number parent batches together and then use my weighted function to calculate and store the value in another column, it needs to traverse through each process order number and for every batch type "output" needs to find the batch type "parent" so I can take the weighted function. I am still trying to see how I can incorporate my weighted function with the sort, so I don't have to do it separately! I did look at other stackoverflow questions but couldn't find something that would fit here. I could use some help! Any guidance is much appreciated.

Expected Output for the first three rows from the sample dataset:

Batch_ID	Process_Order_Number	Batch_type	Quantity	Value	Weighted_Average
A	1	parent	10	2
B	1	parent	20	3
C	1	output			2.66

Solution

We can first calculate the weighted_average for each process_order_number that is a parent:

mapper = df.loc[df.Batch_type.eq('parent'), :]\
           .groupby('Process_Order_Number')\
           .apply(lambda s: (s['Value'] * s['Quantity']).sum() / s['Quantity'].sum())

which yields

Process_Order_Number
1    2.666667
2    0.666667
dtype: float64

And then just assign to a new column:

df.loc[df['Batch_type'].eq('output'), 'Weighted_Average'] = df['Process_Order_Number'].map(mapper)

  Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
0        A                     1     parent        10      2               NaN
1        B                     1     parent        20      3               NaN
2        C                     1     output        15      1          2.666667
3        D                     2     parent         5      4               NaN
4        E                     2     parent        25      0               NaN
5        F                     2     output        50      1          0.666667

If you prefer, you can always "delete" quantity and value values, since they are blank in your expected output.

df.loc[df['Batch_type'].eq('output'), ['Quantity', 'Value']] = np.nan

  Batch_ID  Process_Order_Number Batch_type  Quantity  Value  Weighted_Average
0        A                     1     parent      10.0    2.0               NaN
1        B                     1     parent      20.0    3.0               NaN
2        C                     1     output       NaN    NaN          2.666667
3        D                     2     parent       5.0    4.0               NaN
4        E                     2     parent      25.0    0.0               NaN
5        F                     2     output       NaN    NaN          0.666667