Search code examples
pythonpandasdataframegroup-by

Utilizing a loop or automated method instead of manually typing out each iloc position to change dataframe value


Summary: I want to be able to recreate my function without having to manually type out each individual iloc and using if/elif for scalability for situations when the groups get too large to handle

I have a sample table df_stack_exchange

data_stack_exchange = {'store': ['A','B', 'B', 'C', 'C', 'C', 'D', 'D', 'D', 'D'],
        'worker': [1,1,2,1,2,3,1,2,3,4],
        'boxes': [105, 90, 100, 80, 10, 200, 70, 210, 50, 0],
        'optimal_boxes': [0,0,0,0,0,0,0,0,0,0]}
df_stack_exchange = pandas.DataFrame(data_stack_exchange)
store worker boxes optimal_boxes
0 A 1 105 0
1 B 1 90 0
2 B 2 100 0
3 C 1 80 0
4 C 2 10 0
5 C 3 200 0
6 D 1 70 0
7 D 2 210 0
8 D 3 50 0
9 D 4 0 0

Worker priority is in numerical order and I want to assign them a maximum of 100 boxes until there is no more to assign. The only condition is that if there is only one worker available (store A) then the single worker gets all the boxes even if it is greater than 100. See below for the expected dataframe

store worker boxes optimal_boxes
0 A 1 105 105
1 B 1 90 100
2 B 2 100 90
3 C 1 80 100
4 C 2 10 100
5 C 3 200 90
6 D 1 70 100
7 D 2 210 100
8 D 3 50 100
9 D 4 0 30

I created the following function which yields my expected outcome BUT it is not sustainable as I have to manually type out each iloc. I want to be able to recreate this function with a loop or have it be able to scale without having to keep on adding elifs. It is not a scalable solution for when group sizes become 10+ instead of the current max size of 4 (store D)

def box_optimizer(x):
    if x['optimal_boxes'].count() == 1:
        x['optimal_boxes'].iloc[0] = x['boxes'].sum()
        return x
    elif x['optimal_boxes'].count() == 2:
        x['optimal_boxes'].iloc[0] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[1] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        return x
    elif x['optimal_boxes'].count() == 3:
        x['optimal_boxes'].iloc[0] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[1] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[2] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        return x
    elif x['optimal_boxes'].count() == 4:
        x['optimal_boxes'].iloc[0] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[1] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[2] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum())
        x['optimal_boxes'].iloc[3] += numpy.where(x['boxes'].sum() - x['optimal_boxes'].sum() > 100, 100, x['boxes'].sum() - x['optimal_boxes'].sum()) 
        return x

df_stack_exchange_function = pandas.DataFrame(df_stack_exchange.groupby('store', as_index=False, group_keys=False).apply(box_optimizer))

# the expected dataframe output
df_stack_exchange_function

Solution

  • You don't need a loop, just sum the boxes, divide the total by 100 to get the number of workers with a full box, assign the rest to the last worker:

    def assign_boxes(s):
        total = s.sum()
        d = min(total // 100, len(s)-1)
        return [100]*d+[total - 100*d]+[0]*(len(s)-d-1)
    
    df['optimal_boxes'] = df.groupby('store')['boxes'].transform(assign_boxes)
    

    Output:

      store  worker  boxes  optimal_boxes
    0     A       1    105            105
    1     B       1     90            100
    2     B       2    100             90
    3     C       1     80            100
    4     C       2     10            100
    5     C       3    200             90
    6     D       1     70            100
    7     D       2    210            100
    8     D       3     50            100
    9     D       4      0             30
    

    Explanation:

    [100]*d           # number of workers with 100
    [total - 100*d]   # next worker gets all rest
    [0]*(len(s)-d-1)  # in case there are too many workers, pad with 0s
    

    Detailed explanation:

    # example 1
    s = pd.Series([70, 110, 50, 0])
    total = s.sum()  # 230
    
    # min (230//100, 3)
    # min (2, 3) = 2
    d = min(total // 100, len(s)-1)
    
    out = ([100]*d          # [100] * 2 -> [100, 100]
          +[total - 100*d]  # [230 - 100*2] -> [30]
          +[0]*(len(s)-d-1) # [0]*(4-2-1) -> [0]
          ) # [100, 100, 30, 0]
    
    # example 2
    s = pd.Series([70, 210, 50, 0])
    total = s.sum()  # 330
    
    # min (330//100, 3)
    # min (3, 3) = 3
    d = min(total // 100, len(s)-1)
    
    out = ([100]*d          # [100] * 3 -> [100, 100, 100]
          +[total - 100*d]  # [330 - 100*3] -> [30]
          +[0]*(len(s)-d-1) # [0]*(4-3-1) -> []
          ) # [100, 100, 100, 30]