I'm trying to merge my dataframe columns which contain time info (UTC) into a single column containing datetime object/string. The columns of my df are like this:
YY MM DD HH
98 12 05 11
98 12 05 10
So, I would like a single column containing that time information.
What I've tried so far:
I've merged into a string so that I can parse them into a datetime object by
from dateutil.parser import parse
d_test = (list(df[0].map(str) + " " + df[1].map(str) + " " + df[2].map(str)
+ " " + df[3].map(str)))
Now I just have to parse the list of date strings
parse_d = []
for d in d_test:
parse_d.append(parse(d))
But this is raising me an "unknown string error". I looked into it and it arrises because some of the dates are like:
d_test[5] = '98 12 5 0'
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
tzinfo ={}
parse(d_test[5], tzinfo=tzinfo)
Maybe, I'm missing something very basic but I'm not able to understand how to create this dictionary.
In general, if you know the format of a string, you don't need to use dateutil.parser.parse
to parse it, because you can use datetime.strptime
with a specified string.
In this case, the only slightly unfortunate thing is that you have 2-digit years, some of which are from before 2000. In this case, I'd probably do something like this:
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
Once you've done that, you can use one of the solutions from this question (specifically this one) to convert your individual datetime columns into pandas Timestamps / datetimes.
If these are in UTC, you then use pandas.Series.tz_localize
with 'UTC'
to get timezone-aware datetimes.
Putting it all together:
import pandas as pd
df = pd.DataFrame(
[[98, 12, 5, 11],
[98, 12, 5, 10],
[4, 12, 5, 00]],
columns=['YY', 'MM', 'DD', 'HH'])
# Convert 2-digit years to 4-digit years
cent_21_mask = df['YY'] < 50
df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
# Retrieve the date columns and rename them
col_renames = {'YY': 'year', 'MM': 'month', 'DD': 'day', 'HH': 'hour'}
dt_subset = df.loc[:, list(col_renames.keys())].rename(columns=col_renames)
dt_series = pd.to_datetime(dt_subset)
# Convert to UTC
dt_series = dt_series.dt.tz_localize('UTC')
# Result:
# 0 1998-12-05 11:00:00+00:00
# 1 1998-12-05 10:00:00+00:00
# 2 2004-12-05 00:00:00+00:00
# dtype: datetime64[ns, UTC]
Also, to clarify two things about this statement:
I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.
python-dateutil
is now https://dateutil.readthedocs.io.parse
, in your situation there is no reason to add UTC into a dictionary and pass it to tzinfos
. If you know that your datetimes are going to be naive but that they represent times in UTC, parse them as normal to get naive datetimes, then use datetime.replace(dateutil.tz.tzutc())
to get aware datetimes. The tzinfos
dictionary is for when the timezone information is actually represented in the string.An example of what to do when you have strings representing UTC that don't contain timezone information:
from dateutil.parser import parse
from dateutil import tz
dt = parse('1998-12-05 11:00')
dt = dt.replace(tzinfo=tz.tzutc())