i have got a Condition Dataframe like this which has about 300 rows
pd.DataFrame({"PERSONALNR":["000009461","000009461"],"PERIODE":["202401","202402"],"MANDANT":["LB","LB"],"DA":["01","01"]})
where "PERSONALNR" and "PERIODE" are the conditions i need to meet and the values "MANDANT" and "DA" need to be replaced in the second Dataframe
The Dataframe i want to replace values in looks similar to this one, which has about 110k rows
pd.DataFrame({"PERSONALNR":["000009461","000009461"],"PERIODE":["202401","202402"],"MANDANT":["LB","LB"],"DA":["01","01"], "KSTBEZ":["Springer pool","bla bla"]})
and the solution i came up with is the following:
for row in POOL.itertuples():
LA.loc[(LA.PERSONALNR==row.PERSONALNR)&(LA.PERIODE==row.PERIODE)&(LA.DA=="01")&(LA.KSTBEZ.str.contains("pool")),["MANDANT","DA"]]=[row.MANDANT,row.DA]
My solution works for the Dataframe above quite ok - takes about 10 seconds or so to finish, but i need to do the same operation in a dataframe with 1 Million rows - there it takes about 10 minutes...
can anyone come up with a better solution?
Assuming you don't have duplicated combinations of PERSONALNR/PERIODE in POOL.
You can use a merge
and boolean indexing:
mask = LA['DA'].eq('01') & LA['KSTBEZ'].str.contains('pool')
tmp = (LA[['PERSONALNR', 'PERIODE']].reset_index()
.merge(POOL, on=['PERSONALNR', 'PERIODE'], how='left')
.set_index('index')
)
LA.loc[mask, ['MANDANT', 'DA']] = tmp.loc[mask, ['MANDANT', 'DA']]
Output (using a trailing x
in the MANDANT/DA values from POOL for the demo):
PERSONALNR PERIODE MANDANT DA KSTBEZ
0 000009461 202401 LBx 01x Springer pool
1 000009461 202402 LB 01 bla bla
reset_index
/set_index
?After the merge
, the index is lost, which wouldn't allow correct alignment with the mask or for the assignment to the input LA
.
Example:
LA = pd.DataFrame({'PERSONALNR': ['000009461', '000009461'],
'PERIODE': ['202401', '202402'],
'MANDANT': ['LB', 'LB'],
'DA': ['01', '01'],
'KSTBEZ': ['Springer pool', 'bla bla']},
index=[2, 0]) # Note the non-range index
POOL = pd.DataFrame({'PERSONALNR': ['000009461', '000009461'],
'PERIODE': ['202401', '202402'],
'MANDANT': ['LBx', 'LBx'],
'DA': ['01x', '01x']})
mask = LA['DA'].eq('01') & LA['KSTBEZ'].str.contains('pool')
tmp = (LA[['PERSONALNR', 'PERIODE']]#.reset_index()
.merge(POOL, on=['PERSONALNR', 'PERIODE'], how='left')
#.set_index('index')
)
LA.loc[mask, ['MANDANT', 'DA']] = tmp.loc[mask, ['MANDANT', 'DA']]
Output:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).