I have two Pandas dataframes. Let's call them df1
and df2
. This is what df1
looks like with dummy data:
year,weight
2023,0.392
2022,0.382
2021,0.858
2020,0.135
And this is df2
start_year,end_year
2022,2023
2021,2023
2020,2023
2020,2020
2021,2022
2022,2022
The key fact is that they don't have the same shape. What I want is to create a new column in df2
which is the average of the weights in df1
that have a year
between start_year
and end_year
. So, for example, for the first row, it would give me (0.382 + 0.392)/2=0.387. And for the whole table:
start_year,end_year,weight
2022,2023,0.387
2021,2023,0.544
2020,2023,0.44175
2020,2020,0.135
2021,2022,0.625
2022,2022,0.382
Here's a simplified version of my attempted solution to match my two dummy tables and dummy data:
df2["weight"] = df1[df1["year"].between(df2["start_year"], df2["end_year"], inclusive="both")["weight"].mean()
That results in the following error.
ValueError: Can only compare identically-labeled Series objects
I assume it's because Pandas assumes that I'm trying to compare them row-wise (ie I'm trying to check whether row 1 of df1
is between the start and end years of row 1 of df2
, row 2 to row 2, and so forth) rather than doing a sort of Cartesian product.
I do understand that this could be done quite simply with a for loop, but that would turn this from a task that takes a fraction of a second into one that likely takes multiple seconds (since df1
has about 110 rows and df2
could have up to 10s of thousands of rows in the most extreme usecases). I want this to thus be a vectorized operation.
Vectorized approach:
# get full range of years in df1
year_range = np.array(range(df1["year"].min(), df1["year"].max() + 1))
# create boolean matrix to check years in year_range that fall between start_year and end_year in df2
m = (df2["start_year"].values <= year_range[:, None]) & (
year_range[:, None] <= df2["end_year"].values
)
# align weights from df1 with year_range
w = df1.set_index("year").reindex(year_range)["weight"].values[:, None]
# get weighted average for each row
df2["weight"] = (w * m).sum(axis=0) / m.sum(axis=0)
start_year end_year weight
0 2022 2023 0.38700
1 2021 2023 0.54400
2 2020 2023 0.44175
3 2020 2020 0.13500
4 2021 2022 0.62000
5 2022 2022 0.38200