Search code examples
pythonpandasloopsiteration

Python Apply function to create new rows in loop


Goal: Here is a sample of a dataset that has "ID", "PHASENAME", "CDAYS", "MULTI_FACTOR", "DAY_COUNTER", and "DAILY_LABOR_PERCENT". The goal is to take every "ID", "PHASENAME", "CDAYS" and increase the "DAY_COUNTER" from whatever the last day is increase it to 100 days. I am also to repeat this for "DAILY_LABOR_PERCENT".

currently

ID PHASENAME C_DAYS Multi_Factor DAY_COUNTER DAILY_LABOR_%
BAR C 14.0 7.142857 1 1.0
BAR C 14.0 7.142857 2 5.0
BAR C 14.0 7.142857 13 4.0
BAR C 14.0 7.142857 14 0.0

need it to be

ID PHASENAME C_DAYS Multi_Factor DAY_COUNTER DAILY_LABOR_%
BAR C 14.0 7.142857 1 .14
BAR C 14.0 7.142857 2 some number
BAR C 14.0 7.142857 99 some number
BAR C 14.0 7.142857 100 0.0

In this sample dataset, this person worked for 14 days and their labor percentage is recorded for each day. Instead of have 14 days of labor, I want to have 100 days of labor ( formulas: 'DAY_COUNTER' * 'Multi_Factor' and "DAILY_LABOR_PERCENT" / 'Multi_Factor ).

Also, I'm worried that once I enlarge this dataset having the range as 100 won't work.

Looking for

  1. The Labor is repeating. So when I apply my logic I am not getting the true labor percentage.
  2. Suggestions on how to approach this with a larger dataset.

code

import pandas as pd
import numpy as np


data={
    "ID": [ "BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR","BAR"],
    "PHASENAME": [ "C","C","C","C","C","C","C","C","C","C","C","C","C","C"],
    "C_DAYS": [ 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0],
    "Multi_Factor": [7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857, 7.142857],
    "DAY_COUNTER": [1,2,3,4,5,6,7,8,9,10,11,12,13,14],
    "DAILY_LABOR_PERCENT": [1.0,5.0,9.0,11.0,10.0,9.0,9.0,9.0,8.0,10.0,8.0,7.0,4.0,0.0],
    }

df=pd.DataFrame(data)
df1=df.copy()


n = 100
first_day = df["DAY_COUNTER"].iloc[0]

rep = np.tile(df.values, (n // len(df) + 1, 1))
out = pd.DataFrame(rep, columns=df.columns).iloc[:n]

out["DAY_COUNTER"] = range(first_day, n + first_day)


def labor_factor(row):
    return row['DAILY_LABOR_PERCENT'] / row['Multi_Factor']

#creating empty list to store rows for each id, phasename, and cdays
new_rows= []

#iterating through each index range and applying function to get day_counter to 100

for i in range(len(out)):
    out['new_labor']= out.apply(labor_factor,axis=1)


Solution

  • As a general rule, you want to avoid iterating over dataframe rows, as that's very inefficient.

    IIUC, you can instead repeat your dataframe using something vectorized like numpy.tile until you have the desired number of rows, then adjust the values accordingly:

    n = 100
    first_day = df["DAY_COUNTER"].iloc[0]
    
    rep = np.tile(df.values, (n // len(df) + 1, 1))
    out = pd.DataFrame(rep, columns=df.columns).iloc[:n]
    
    out["DAY_COUNTER"] = range(first_day, n + first_day)
    
         ID PHASENAME C_DAYS Multi_Factor  DAY_COUNTER DAILY_LABOR_PERCENT
    0   BAR         C   14.0     7.142857            1                 1.0
    1   BAR         C   14.0     7.142857            2                 5.0
    2   BAR         C   14.0     7.142857            3                 9.0
    3   BAR         C   14.0     7.142857            4                11.0
    4   BAR         C   14.0     7.142857            5                10.0
    ..  ...       ...    ...          ...          ...                 ...
    95  BAR         C   14.0     7.142857           96                 7.0
    96  BAR         C   14.0     7.142857           97                 4.0
    97  BAR         C   14.0     7.142857           98                 0.0
    98  BAR         C   14.0     7.142857           99                 1.0
    99  BAR         C   14.0     7.142857          100                 5.0