Search code examples
pythonpandasdataframedatetimepandasql

Filtering and manipulating datetime in pandas dataframe


Okay, guys, I am wondering what's going. I have a pandas dataframe that I pulled from MySQL.

In fact here was my query syntax:

query = "SELECT * FROM mywebsite.com WHERE date BETWEEN '2019-12-01' AND '2020-03-31'"

websitedata = pd.read_sql(query,con=engine)

Then I exported the data as CSV. Now today reading from the CSV and trying to split the data in chunks of dates

Dec2019 = df.loc[(df.date >= "2019-12-01") & (df.date <= "2019-12-31")]
Jan2020 = df.loc[(df.date >= "2020-01-01") & (df.date <= "2020-01-31")]
Feb2020 = df.loc[(df.date >= "2020-02-01") & (df.date <= "2020-02-29")]
Mar2020 = df.loc[(df.date >= "2020-03-01") & (df.date <= "2020-03-31")]


len(df) == len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020) # gives me False

Infact len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020) gives 376440

And len(df) gives 384274

How can I do preview the frame to see what went wrong? Like December 2019, January 2020,..., March 2020 to understand the issue?

PS: the date is already a pandas datetime column


Solution

  • So I figured out the most efficient way to do this that gave me the exact frame without date leakages.

    Using datetime dt accessor

    Jan2020 = df[df.date.dt.month == 1]
    Dec2019 = df[df.date.dt.month == 12]
    Feb2020 = df[df.date.dt.month == 2]
    Mar2020 = df[df.date.dt.month == 3]
    

    Infact this now returns True

    len(df) == len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020)
    

    Credit: How to filter a dataframe of dates by a particular month/day?