Search code examples
pythonpandasdata-preprocessing

Pandas: Append copy of rows changing only values in multiple columns larger than max allowed to split bin values


Problem: I have a data frame that I need to modify based on the values of particular column. If value of any column value is greater than that of maximum allowed then a new row will be created based upon distribution into equally sized bins (taking integer division between data value and max allowed value)

Table and Explanation:

Original:

Index Data 1 Data 2 Max. Allowed
1 1 2 3
2 10 5 8
3 7 12 5

Required:

Index Values in brackets refers to the original index value

Index Data 1 Data 2 Max. Allowed
1 (1) 1 2 3
2 (2) 8 5 8
3 2 0 8
4 (3) 5 5 5
5 2 5 5
6 0 2 5

Since original index = 2, had Data1 = 10 which is greater than max allowed = 8. This row has been broken into two rows as shown in the above table.

Attempt: I was able to find those columns which had value greater than max allowed and number of rows to be inserted. But I had a confusion wether that approach would work if both columns would have value greater than the max allowed value (as in case of index = 3). The values indicate how many more rows to be inserted for each index value for particular column.

Index Data 1 Data 2
1 0 0
2 1 0
3 1 2

Solution

  • Let's approach in the following steps:

    Step 1: Preparation of split values:

    Define custom lambda function to turn Data 1, Data 2 into lists of values split with Max. Allowed if larger than it. Hold the expanded lists in 2 new columns Data 1x, Data 2x:

    f = lambda x, y, z: [z] * (x // z) + [x % z] + [0] * (max(x//z, y//z) - x//z)
    
    df['Data 1x'] = df.apply(lambda x: f(x['Data 1'], x['Data 2'], x['Max. Allowed'])  , axis=1)
    df['Data 2x'] = df.apply(lambda x: f(x['Data 2'], x['Data 1'], x['Max. Allowed'])  , axis=1)
    

    The lambda function is designed to add 0 into the lists to make the number of elements in lists in the same row to have the same lengths.

    Intermediate result:

    print(df)
    
       Index  Data 1  Data 2  Max. Allowed    Data 1x    Data 2x
    0      1       1       2             3        [1]        [2]
    1      2      10       5             8     [8, 2]     [5, 0]
    2      3       7      12             5  [5, 2, 0]  [5, 5, 2]
    

    Step 2: Explode split values into separate rows:

    Case 1: If your Pandas version is 1.3 or above

    We use DataFrame.explode() to explode the 2 new columns: (this part of feature to explode multiple columns requires Pandas version 1.3 or above)

    df = df.explode(['Data 1x', 'Data 2x'])
    

    Case 2: For Pandas version lower than 1.3, try the following way to explode:

    df = df.apply(pd.Series.explode)
    

    Case 3: If the above 2 ways to explode don't work in your programming environment, use:

    df_exp = df.explode('Data 1x')[['Index', 'Data 1', 'Data 2', 'Max. Allowed']].reset_index(drop=True)
    df_1x = df.explode('Data 1x')[['Data 1x']].reset_index(drop=True)
    df_2x = df.explode('Data 2x')[['Data 2x']].reset_index(drop=True)
    
    df = df_exp.join([df_1x, df_2x])
    

    Result:

    print(df)
    
       Index  Data 1  Data 2  Max. Allowed Data 1x Data 2x
    0      1       1       2             3       1       2
    1      2      10       5             8       8       5
    1      2      10       5             8       2       0
    2      3       7      12             5       5       5
    2      3       7      12             5       2       5
    2      3       7      12             5       0       2
    

    Step 3: Formatting to the required output:

    # select and rename columns
    df = (df[['Index', 'Data 1x',  'Data 2x', 'Max. Allowed']]
            .rename({'Data 1x': 'Data 1', 'Data 2x': 'Data 2'}, axis=1)
            .reset_index(drop=True)
         )
    
    # reset the `Index` values
    df['Index'] = df.index + 1  
    

    Final result:

    print(df)
    
    
       Index Data 1 Data 2  Max. Allowed
    0      1      1      2             3
    1      2      8      5             8
    2      3      2      0             8
    3      4      5      5             5
    4      5      2      5             5
    5      6      0      2             5