Search code examples
pythonpandasdataframereplaceseries

Python pandas replace the values of a DataFrame by searching in another DataFrame with non-identical labels


I have two dataframes. df1 contains CITIES and the total number of VISITS. df2 contains the VISITS records.

Periodically df1 is updated with the data from df2 with the new VISITS.

df1 example (before updating)

ID    NAME     VISITS
---
01    CITY1    01
02    CITY2    01
...
06    CITYZ    12

df2 example

CITY     NUMBER
---
...
CITY1    01
CITY2    01    <--- highest of CITY2
CITYZ    13
CITY1    02
...
CITYZ    14
CITY1    03    <--- highest of CITY1
CITYZ    15    <--- highest of CITYZ

To update it will look for df1['NAME'] in df2['CITY'] (this is the correlation) and take the highest df2['NUMBER'] and put it in df1['VISITS'] of that CITY.

df1 after updating

ID    NAME     VISITS
---
01    CITY1    03    <--- updated
02    CITY2    01    <--- updated or not, it doesn't matter
...
06    CITYZ    15    <--- updated

my approach:

df2.loc[df2['CITY'] == 'CITYZ', 'NUMBER'].max()

I get the max number of "CITIZ" (hardcoded), but I don't know how to link it to df1.

The next is clearly wrong, but it is the idea:

df1['VISITS'] = df2.loc[df2['CITY'] == df1['NAME'], 'NUMBER'].max()

This "solution" gives the following error: ValueError: Can only compare identically-labeled Series objects


Solution

  • You can remove duplicates from df2 and keep only the last updated values (if your cities are already sorted by 'NUMBER', you can remove sort_values):

    df1['VISITS'] = df1['NAME'].map(df2.sort_values('NUMBER', ascending=True)
                                       .drop_duplicates('CITY', keep='last')
                                       .set_index('CITY')['NUMBER'])
    print(df1)
    
    # Output
       ID   NAME  VISITS
    0   1  CITY1       3
    1   2  CITY2       1
    2   6  CITYZ      15
    

    Based on @Nick's answer:

    df1['VISITS'] = df1['NAME'].map(df2.groupby('CITY')['NUMBER'].max())
    

    Input data:

    >>> df1
       ID   NAME  VISITS
    0   1  CITY1       1
    1   2  CITY2       1
    2   6  CITYZ      12
    
    >>> df2
        CITY  NUMBER
    0  CITY1       1
    1  CITY2       1
    2  CITYZ      13
    3  CITY1       2
    4  CITYZ      14
    5  CITY1       3
    6  CITYZ      15