I have two dataframes. df1 contains CITIES and the total number of VISITS. df2 contains the VISITS records.
Periodically df1 is updated with the data from df2 with the new VISITS.
df1 example (before updating)
ID NAME VISITS
---
01 CITY1 01
02 CITY2 01
...
06 CITYZ 12
df2 example
CITY NUMBER
---
...
CITY1 01
CITY2 01 <--- highest of CITY2
CITYZ 13
CITY1 02
...
CITYZ 14
CITY1 03 <--- highest of CITY1
CITYZ 15 <--- highest of CITYZ
To update it will look for df1['NAME'] in df2['CITY'] (this is the correlation) and take the highest df2['NUMBER'] and put it in df1['VISITS'] of that CITY.
df1 after updating
ID NAME VISITS
---
01 CITY1 03 <--- updated
02 CITY2 01 <--- updated or not, it doesn't matter
...
06 CITYZ 15 <--- updated
my approach:
df2.loc[df2['CITY'] == 'CITYZ', 'NUMBER'].max()
I get the max number of "CITIZ" (hardcoded), but I don't know how to link it to df1.
The next is clearly wrong, but it is the idea:
df1['VISITS'] = df2.loc[df2['CITY'] == df1['NAME'], 'NUMBER'].max()
This "solution" gives the following error: ValueError: Can only compare identically-labeled Series objects
You can remove duplicates from df2
and keep only the last updated values (if your cities are already sorted by 'NUMBER', you can remove sort_values
):
df1['VISITS'] = df1['NAME'].map(df2.sort_values('NUMBER', ascending=True)
.drop_duplicates('CITY', keep='last')
.set_index('CITY')['NUMBER'])
print(df1)
# Output
ID NAME VISITS
0 1 CITY1 3
1 2 CITY2 1
2 6 CITYZ 15
Based on @Nick's answer:
df1['VISITS'] = df1['NAME'].map(df2.groupby('CITY')['NUMBER'].max())
Input data:
>>> df1
ID NAME VISITS
0 1 CITY1 1
1 2 CITY2 1
2 6 CITYZ 12
>>> df2
CITY NUMBER
0 CITY1 1
1 CITY2 1
2 CITYZ 13
3 CITY1 2
4 CITYZ 14
5 CITY1 3
6 CITYZ 15