Search code examples
pythonpandasdataframe

How can I assign a new column that is equal to the mean of the rows of another dataframe which fall in a range?


I have two Pandas dataframes. Let's call them df1 and df2. This is what df1 looks like with dummy data:

year,weight
2023,0.392
2022,0.382
2021,0.858
2020,0.135

And this is df2

start_year,end_year
2022,2023
2021,2023
2020,2023
2020,2020
2021,2022
2022,2022

The key fact is that they don't have the same shape. What I want is to create a new column in df2 which is the average of the weights in df1 that have a year between start_year and end_year. So, for example, for the first row, it would give me (0.382 + 0.392)/2=0.387. And for the whole table:

start_year,end_year,weight
2022,2023,0.387
2021,2023,0.544
2020,2023,0.44175
2020,2020,0.135
2021,2022,0.625
2022,2022,0.382

Here's a simplified version of my attempted solution to match my two dummy tables and dummy data:

df2["weight"] = df1[df1["year"].between(df2["start_year"], df2["end_year"], inclusive="both")["weight"].mean()

That results in the following error.

ValueError: Can only compare identically-labeled Series objects

I assume it's because Pandas assumes that I'm trying to compare them row-wise (ie I'm trying to check whether row 1 of df1 is between the start and end years of row 1 of df2, row 2 to row 2, and so forth) rather than doing a sort of Cartesian product.

I do understand that this could be done quite simply with a for loop, but that would turn this from a task that takes a fraction of a second into one that likely takes multiple seconds (since df1 has about 110 rows and df2 could have up to 10s of thousands of rows in the most extreme usecases). I want this to thus be a vectorized operation.


Solution

  • Vectorized approach:

    # get full range of years in df1
    year_range = np.array(range(df1["year"].min(), df1["year"].max() + 1))
    # create boolean matrix to check years in year_range that fall between start_year and end_year in df2
    m = (df2["start_year"].values <= year_range[:, None]) & (
        year_range[:, None] <= df2["end_year"].values
    )
    # align weights from df1 with year_range
    w = df1.set_index("year").reindex(year_range)["weight"].values[:, None]
    # get weighted average for each row
    df2["weight"] = (w * m).sum(axis=0) / m.sum(axis=0)
    
       start_year  end_year   weight
    0        2022      2023  0.38700
    1        2021      2023  0.54400
    2        2020      2023  0.44175
    3        2020      2020  0.13500
    4        2021      2022  0.62000
    5        2022      2022  0.38200