Search code examples
pandasiterated-function

Row by Row update / change of Values based on 2nd Dataframe with conditions


i have got a Condition Dataframe like this which has about 300 rows

pd.DataFrame({"PERSONALNR":["000009461","000009461"],"PERIODE":["202401","202402"],"MANDANT":["LB","LB"],"DA":["01","01"]})

where "PERSONALNR" and "PERIODE" are the conditions i need to meet and the values "MANDANT" and "DA" need to be replaced in the second Dataframe

The Dataframe i want to replace values in looks similar to this one, which has about 110k rows

pd.DataFrame({"PERSONALNR":["000009461","000009461"],"PERIODE":["202401","202402"],"MANDANT":["LB","LB"],"DA":["01","01"], "KSTBEZ":["Springer pool","bla bla"]})

and the solution i came up with is the following:

for row in POOL.itertuples():
    LA.loc[(LA.PERSONALNR==row.PERSONALNR)&(LA.PERIODE==row.PERIODE)&(LA.DA=="01")&(LA.KSTBEZ.str.contains("pool")),["MANDANT","DA"]]=[row.MANDANT,row.DA]

My solution works for the Dataframe above quite ok - takes about 10 seconds or so to finish, but i need to do the same operation in a dataframe with 1 Million rows - there it takes about 10 minutes...

can anyone come up with a better solution?


Solution

  • Assuming you don't have duplicated combinations of PERSONALNR/PERIODE in POOL.

    You can use a merge and boolean indexing:

    mask = LA['DA'].eq('01') & LA['KSTBEZ'].str.contains('pool')
    tmp = (LA[['PERSONALNR', 'PERIODE']].reset_index()
           .merge(POOL, on=['PERSONALNR', 'PERIODE'], how='left')
           .set_index('index')
          )
    
    LA.loc[mask, ['MANDANT', 'DA']] = tmp.loc[mask, ['MANDANT', 'DA']]
    

    Output (using a trailing x in the MANDANT/DA values from POOL for the demo):

      PERSONALNR PERIODE MANDANT   DA         KSTBEZ
    0  000009461  202401     LBx  01x  Springer pool
    1  000009461  202402      LB   01        bla bla
    

    Why do we need reset_index/set_index?

    After the merge, the index is lost, which wouldn't allow correct alignment with the mask or for the assignment to the input LA.

    Example:

    LA = pd.DataFrame({'PERSONALNR': ['000009461', '000009461'],
                       'PERIODE': ['202401', '202402'],
                       'MANDANT': ['LB', 'LB'],
                       'DA': ['01', '01'],
                       'KSTBEZ': ['Springer pool', 'bla bla']},
                      index=[2, 0]) # Note the non-range index
    POOL = pd.DataFrame({'PERSONALNR': ['000009461', '000009461'],
                         'PERIODE': ['202401', '202402'],
                         'MANDANT': ['LBx', 'LBx'],
                         'DA': ['01x', '01x']})
    
    mask = LA['DA'].eq('01') & LA['KSTBEZ'].str.contains('pool')
    tmp = (LA[['PERSONALNR', 'PERIODE']]#.reset_index()
           .merge(POOL, on=['PERSONALNR', 'PERIODE'], how='left')
           #.set_index('index')
          )
    
    LA.loc[mask, ['MANDANT', 'DA']] = tmp.loc[mask, ['MANDANT', 'DA']]
    

    Output:

    IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).