Okay, guys, I am wondering what's going. I have a pandas dataframe
that I pulled from MySQL
.
In fact here was my query syntax:
query = "SELECT * FROM mywebsite.com WHERE date BETWEEN '2019-12-01' AND '2020-03-31'"
websitedata = pd.read_sql(query,con=engine)
Then I exported
the data as CSV
. Now today reading
from the CSV
and trying to split the data in chunks
of dates
Dec2019 = df.loc[(df.date >= "2019-12-01") & (df.date <= "2019-12-31")]
Jan2020 = df.loc[(df.date >= "2020-01-01") & (df.date <= "2020-01-31")]
Feb2020 = df.loc[(df.date >= "2020-02-01") & (df.date <= "2020-02-29")]
Mar2020 = df.loc[(df.date >= "2020-03-01") & (df.date <= "2020-03-31")]
len(df) == len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020) # gives me False
Infact len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020)
gives 376440
And len(df)
gives 384274
How can I do preview the frame to see what went wrong? Like December 2019, January 2020,..., March 2020 to understand the issue?
PS: the date is already a pandas datetime
column
So I figured out the most efficient way to do this that gave me the exact frame without date leakages.
Using datetime dt accessor
Jan2020 = df[df.date.dt.month == 1]
Dec2019 = df[df.date.dt.month == 12]
Feb2020 = df[df.date.dt.month == 2]
Mar2020 = df[df.date.dt.month == 3]
Infact this now returns True
len(df) == len(Dec2019) + len(Jan2020) + len(Feb2020) + len(Mar2020)
Credit: How to filter a dataframe of dates by a particular month/day?