Search code examples
pythonpandasdatetimetimedeltarelativedelta

Not all dates are captured when filtering by dates. Python Pandas


I am filtering a dataframe by dates to produce two seperate versions:

  1. Data from only today's date
  2. Data from the last two years

However, when I try to filter on the date, it seems to miss dates that are within the last two years.

date_format = '%m-%d-%Y'  # desired date format

today = dt.now().strftime(date_format)  # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date()  # converting 'today' into a datetime object

today = today.strftime(date_format)
two_years = today - relativedelta(years=2)  # date is today's date minus two years. 
two_years = two_years.strftime(date_format)

# normalizing the format of the date column to the desired format 
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)

df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]

Which results in:

all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']

The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.


Solution

  • you don't need to convert anything to string, simply work with datetime dtype. Ex:

    import pandas as pd
    
    df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
    
    today = pd.Timestamp('now')
    
    print(df[df['date'].dt.date == today.date()])
    #         date
    # 0 2020-07-17
    
    print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
    #         date
    # 1 2020-07-15
    # 2 2019-08-01
    

    What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.