for me a Python fresh, I m working on a code for building bill of materials(BOM), which can get the customer's desired item id and purchase quantity from the order sheet in the excel file and use another BOM sheet(also in the same excel) to calculate the quantity of all raw materials needed.After subtracting the stock of raw materials from the inventory, the remaining demand is input to the dictionary output like{id_material: quantity}. The format of BOM sheet as follow
item_id | process_id | process_No. | IN/OUT | material_id | quantity_in | quantity_out |
---|---|---|---|---|---|---|
A | z420 | 1 | IN | 12125 | 100 | Nan |
A | z420 | 1 | OUT | A-z512-2 | Nan | 100 |
A | z512 | 2 | IN | A-z512-2 | 100 | Nan |
A | z512 | 2 | OUT | A-z600-3 | Nan | 120 |
A | z600 | 3 | IN | A-z600-3 | 120 | Nan |
A | z600 | 3 | OUT | 14551 | Nan | -20 |
A | z600 | 3 | OUT | A | Nan | 100 |
attr: processs_id : Process id used
attr: processs_No.: Process order in the process path. Not always consecutive or regular like natural numbers, such as (51, 60, 70, 100)
attr: IN/OUT : Indicate whether the material is a raw material or an output
the pandas.dataframe I use is also like this but two attr columns added: 'count_demand' for indicate the quantity needed and 'flag' for my function to identify the material that needs to be executed.Lets call it 'df_demand'.
I was able to complete the functions that would serve my purpose, but the speed was not satisfactory. I tested with moudles such as timeit and found some operations that took a lot of time, but I couldn't think of a way to optimize them, so I came here for help.
def calculate_demand_raw(row, df_demand):
try:
if np.isnan(row['quantity_out']):
raise ValueError('To avoid including those recycled materials with negative outputs')
list_index = list(df_demand['item_id'].isin([row['item_id']]) &
df_demand['process_No.'].isin([row['process_No.']]) &
df_demand['IN/OUT'].isin(['IN']))
index = [i for i, x in enumerate(list_index) if x==True]
# Search to find the index of the required generation process
df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*
df_demand.loc[index, 'quantity_in']
# calculate quantity of raw materials.
df_demand.loc[index, 'flag'] = 1
except ValueError:
pass # Prevent the query material is the base material, no process generation
df_demand.loc[row.name, 'flag'] = 0
df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)
timeit told me that, in the function, it takes three times as long to find the index of the eligible rows as it does to calculate the quantity of the raw material, and calculate_demand_raw is also the most time-consuming function in a loop.So could anyone have a deal of search index time reduction?
def fill_demand(row, qty_sum_demand, df_demand):
df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
'count_demand'].tolist()
df_demand.loc[index, 'flag'] = 1
df_demand.loc[index_generated_process].apply(lambda row:
fill_demand(row, qty_sum_demand, df_demand), axis=1)
Is it the conditional search in the function that makes it take so long, just like calculate_demand_raw? Is it possible to turn this operation into a faster numpy vectorization operation?
A modified version of your example to show some functionality:
df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag
0 A z420 1 IN 12125 100.0 NaN 0
1 A z420 1 OUT A-z512-2 NaN 100.0 0
2 A z512 2 IN A-z512-2 100.0 NaN 0
3 A z512 2 OUT A-z600-3 NaN 120.0 0
4 A z600 3 IN A-z600-2 NaN 400.0 1
5 A z600 3 IN A-z600-3 120.0 200.0 1 # <- only valid flag (non-na quantity_in)
6 A z600 3 IN 14551 NaN -20.0 0
7 A z600 3 OUT A NaN 100.0 0
Here's one way you may implement calculate_demand_raw
without using .apply
+ .loc
lookups.
Generally the you want to .merge
in cases like this so you have all the data "side-by-side" which allows you to then work in a "vectorized manner".
flags = df[df['flag'] == 1].dropna(subset='quantity_in')
df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
df_m.loc[
(df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'),
df.columns.difference(['item_id', 'process_No.']) + '_y'
] = float('nan')
rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
A breakdown of the steps:
Find all flag rows.
flags = df[df['flag'] == 1].dropna(subset='quantity_in')
The .dropna()
is to emulate the if np.isnan(row['quantity_out'])
line in your code.
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag
5 A z600 3 IN A-z600-3 120.0 200.0 1
left-merge with flags:
df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 z600 IN A-z600-3 120.0 200.0 1.0
5 A z600 3 IN A-z600-3 120.0 200.0 1 z600 IN A-z600-3 120.0 200.0 1.0
6 A z600 3 IN 14551 NaN -20.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 z600 IN A-z600-3 120.0 200.0 1.0
You want to discard OUT
rows, it's unclear if you want to compare the flag rows with themselves so I've discarded them here.
You can reset the _y
columns on the rows you want to discard back to NaN
df_m.loc[
(df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'),
df.columns.difference(['item_id', 'process_No.']) + '_y'
] = float('nan')
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 1 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -20.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN
You can then perform your calculations on the rows with nonna _y
values:
rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 1 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 1 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -2400.0 0 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN
Toggle the flag values:
df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
item_id process_id process_No. IN/OUT material_id quantity_in quantity_out flag process_id_y IN/OUT_y material_id_y quantity_in_y quantity_out_y flag_y
0 A z420 1 IN 12125 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
1 A z420 1 OUT A-z512-2 NaN 100.0 0 NaN NaN NaN NaN NaN NaN
2 A z512 2 IN A-z512-2 100.0 NaN 0 NaN NaN NaN NaN NaN NaN
3 A z512 2 OUT A-z600-3 NaN 120.0 0 NaN NaN NaN NaN NaN NaN
4 A z600 3 IN A-z600-2 NaN 400.0 0 NaN NaN NaN NaN NaN NaN
5 A z600 3 IN A-z600-3 120.0 200.0 0 NaN NaN NaN NaN NaN NaN
6 A z600 3 IN 14551 NaN -2400.0 1 z600 IN A-z600-3 120.0 200.0 1.0
7 A z600 3 OUT A NaN 100.0 0 NaN NaN NaN NaN NaN NaN