python pandas dataframe datetime timedelta

PerformanceWarning when adding months to datetime values in Pandas DataFrame using apply() and DateOffset

I have a Pandas DataFrame with a "date" column containing datetime values, and a "months" column containing integer values representing the number of months to add to each date. I want to calculate a new date for each row that is the original "date" plus the specified number of months in the months column.

I'm using the following code:

df['date'] + df["months"].apply(lambda y: pd.DateOffset(months=y))

This code seems to work correctly, but I'm getting a PerformanceWarning that says:

PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized
  warnings.warn

I need to do this process with multiple date and months columns in my dataset and is taking a long time to perform this task.

Sample of input:

	date	months
0	2020-01-01	3
1	2021-06-15	6
2	2022-02-28	1

Sample of ouput:

	date	months	new_date
0	2020-01-01	3	2020-04-01
1	2021-06-15	6	2021-12-15
2	2022-02-28	1	2022-03-28

What does this warning mean, and how can I improve my code to avoid it? I'm concerned that this may be affecting the performance of my code, and I want to make sure that it's as efficient as possible.

Solution

You can remove the performance warning by using .dt.date on the date column.

For faster performance you can use dateutil.relativedelta:

from dateutil.relativedelta import relativedelta

df['new_date'] = df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))
print(df)

Prints (without performance warning):

        date  months    new_date
0 2020-01-01       3  2020-04-01
1 2021-06-15       6  2021-12-15
2 2022-02-28       1  2022-03-28

Quick benchmark:

from timeit import timeit
from dateutil.relativedelta import relativedelta


def fn1(df):
    return df['date'].dt.date + df["months"].apply(lambda y: pd.DateOffset(months=y))

def fn2(df):
    return df['date'].dt.date + df["months"].apply(lambda y: relativedelta(months=y))

# create bigger dataframe
df = pd.concat([df]*100).reset_index(drop=True)

t1 = timeit('fn1(x)', setup='x = df.copy()', number=100, globals=globals())
t2 = timeit('fn2(x)', setup='x = df.copy()', number=100, globals=globals())

print(t1)
print(t2)

Prints:

0.6759202829562128
0.2161861159838736