I have a Pandas DataFrame with a "date" column containing datetime values, and a "months" column containing integer values representing the number of months to add to each date. I want to calculate a new date for each row that is the original "date" plus the specified number of months in the months column.
I'm using the following code:
df['date'] + df["months"].apply(lambda y: pd.DateOffset(months=y))
This code seems to work correctly, but I'm getting a PerformanceWarning that says:
PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized
warnings.warn
I need to do this process with multiple date and months columns in my dataset and is taking a long time to perform this task.
Sample of input:
date | months | |
---|---|---|
0 | 2020-01-01 | 3 |
1 | 2021-06-15 | 6 |
2 | 2022-02-28 | 1 |
Sample of ouput:
date | months | new_date | |
---|---|---|---|
0 | 2020-01-01 | 3 | 2020-04-01 |
1 | 2021-06-15 | 6 | 2021-12-15 |
2 | 2022-02-28 | 1 | 2022-03-28 |
What does this warning mean, and how can I improve my code to avoid it? I'm concerned that this may be affecting the performance of my code, and I want to make sure that it's as efficient as possible.
You can remove the performance warning by using .dt.date
on the date column.
For faster performance you can use dateutil.relativedelta
:
from dateutil.relativedelta import relativedelta
df['new_date'] = df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))
print(df)
Prints (without performance warning):
date months new_date
0 2020-01-01 3 2020-04-01
1 2021-06-15 6 2021-12-15
2 2022-02-28 1 2022-03-28
Quick benchmark:
from timeit import timeit
from dateutil.relativedelta import relativedelta
def fn1(df):
return df['date'].dt.date + df["months"].apply(lambda y: pd.DateOffset(months=y))
def fn2(df):
return df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))
# create bigger dataframe
df = pd.concat([df]*100).reset_index(drop=True)
t1 = timeit('fn1(x)', setup='x = df.copy()', number=100, globals=globals())
t2 = timeit('fn2(x)', setup='x = df.copy()', number=100, globals=globals())
print(t1)
print(t2)
Prints:
0.6759202829562128
0.2161861159838736