Search code examples
pythonpandasdataframelookup

Lookup based on two conditions and take the first value


I would look like to look up value in df based on conditions df_col1 <= df2_col1 and df_col2 > df2_col1. If there are multiple values, take the first one. Any hint?

import pandas as pd

df = pd.DataFrame({'day1': [3, 7, 15], 'day2': [20, 15, 11], 'value': [10, 20, 30]})

df2 = pd.DataFrame({'day': [2, 8, 15, 22]})


# find value where day1 <= day and day2 > day:
# 2 is not greater/equal [3, 7, 15], hence 0
# 8 is greater/equal [3, 7] and less than [20, 15], hence [10]
# 15 is greater/equal [3, 7, 15] and less than [20], hence [10]
# 22 is greater/equal [15] but 22 is not less than [20], hence 0


df2['value'] = 1  # [0, 10, 10, 0]

I have the following but cannot move forward.

df2['day'].apply(lambda x: df.query(f"day1 <= {x} & day2 > {x}"))

I suspect the first match is selected with idxmax(). And zero is achieved via fillna(0).


Solution

  • Perfect job for array broadcasting:

    # Convert to numpy array, and raise the array by 1 dimension to prepare for the
    # broadcasting
    day = df2["day"].to_numpy()[:, None]
    # Convert day1 and day2 to numpy arrays too
    day1, day2 = df[["day1", "day2"]].to_numpy().T
    
    # Compare every row in df1 against every row in df2
    mask = (day1 <= day) & (day < day2)
    # A match is found if any column on a row is True
    match = np.any(mask, axis=1)
    # Get the index of the first True column
    idx = np.argmax(mask, axis=1)
    
    # Convert the value column to numpy array
    value = df["value"].to_numpy()
    # When a match is found, return the corresponding value. If not, return 0
    df2["value"] = np.where(match, value[idx], 0)