Search code examples
pythonpandasdatetimepython-dateutil

How to Parse 0 hour with dateutil


I'm trying to merge my dataframe columns which contain time info (UTC) into a single column containing datetime object/string. The columns of my df are like this:

YY MM DD HH
98 12 05 11
98 12 05 10

So, I would like a single column containing that time information.

What I've tried so far:

I've merged into a string so that I can parse them into a datetime object by

from dateutil.parser import parse
d_test = (list(df[0].map(str) + " " + df[1].map(str) + " " + df[2].map(str) 
+ " " + df[3].map(str)))

Now I just have to parse the list of date strings

parse_d = []
for d in d_test:
    parse_d.append(parse(d))

But this is raising me an "unknown string error". I looked into it and it arrises because some of the dates are like:

d_test[5] = '98 12 5 0'

I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.

tzinfo ={}
parse(d_test[5], tzinfo=tzinfo)

Maybe, I'm missing something very basic but I'm not able to understand how to create this dictionary.


Solution

  • In general, if you know the format of a string, you don't need to use dateutil.parser.parse to parse it, because you can use datetime.strptime with a specified string.

    In this case, the only slightly unfortunate thing is that you have 2-digit years, some of which are from before 2000. In this case, I'd probably do something like this:

    cent_21_mask =  df['YY'] < 50
    df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
    df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
    

    Once you've done that, you can use one of the solutions from this question (specifically this one) to convert your individual datetime columns into pandas Timestamps / datetimes.

    If these are in UTC, you then use pandas.Series.tz_localize with 'UTC' to get timezone-aware datetimes.

    Putting it all together:

    import pandas as pd
    
    df = pd.DataFrame(
        [[98, 12, 5, 11],
         [98, 12, 5, 10],
         [4, 12, 5, 00]],
         columns=['YY', 'MM', 'DD', 'HH'])
    
    # Convert 2-digit years to 4-digit years
    cent_21_mask =  df['YY'] < 50
    df.loc[:, 'YY'] = df.loc[:, 'YY'] + 1900
    df.loc[cent_21_mask, 'YY'] = df.loc[cent_21_mask, 'YY'] + 100
    
    
    # Retrieve the date columns and rename them
    col_renames = {'YY': 'year', 'MM': 'month', 'DD': 'day', 'HH': 'hour'}
    dt_subset = df.loc[:, list(col_renames.keys())].rename(columns=col_renames)
    dt_series = pd.to_datetime(dt_subset)
    
    # Convert to UTC
    dt_series = dt_series.dt.tz_localize('UTC')
    
    # Result:
    # 0   1998-12-05 11:00:00+00:00
    # 1   1998-12-05 10:00:00+00:00
    # 2   2004-12-05 00:00:00+00:00
    # dtype: datetime64[ns, UTC]
    

    Also, to clarify two things about this statement:

    I've tried reading the detailed documentation of dateutil (https://labix.org/python-dateutil) and what I understood is that I've to make a dictionary specifying the timezone as key (UTC in my case) and that might solve the error.

    1. The correct documentation for python-dateutil is now https://dateutil.readthedocs.io.
    2. If you are using parse, in your situation there is no reason to add UTC into a dictionary and pass it to tzinfos. If you know that your datetimes are going to be naive but that they represent times in UTC, parse them as normal to get naive datetimes, then use datetime.replace(dateutil.tz.tzutc()) to get aware datetimes. The tzinfos dictionary is for when the timezone information is actually represented in the string.

    An example of what to do when you have strings representing UTC that don't contain timezone information:

    from dateutil.parser import parse
    from dateutil import tz
    
    dt = parse('1998-12-05 11:00')
    dt = dt.replace(tzinfo=tz.tzutc())