Search code examples
pythonpandaspython-dateutil

Apply a lambda function to timestamp to adjust nonsensical offset


I have a dataframe 'pages' with a timestamp field 'dimension3'. dimension3 is supposed to be ISO time but there are some nonsensical offsets causing problems.

The dataframe is pageviews of a website from visitors all over the world, so each row has it's own timestamp and timezone offset.

The offset for ISO time should be within the range -12 and +14.

Most of my timestamps fit within this range. Here is an example sound data point:

x = dateutil.parser.parse('2019-11-11T07:08:09.640-4:00')
x
datetime.datetime(2019, 11, 11, 7, 8, 9, 640000, tzinfo=tzoffset(None, -14400))

Here is an example of a problematic data point that appears in my data frame:

y = dateutil.parser.parse('2019-11-11T07:08:09.640-31:00')
y
datetime.datetime(2019, 11, 11, 7, 8, 9, 640000, tzinfo=tzoffset(None, -111600))

The problematic one has an offset of -31 which is greater than the minimun bound of -12.

This is problematic because when I try to send these data to a postgres database with a field type of timestamptz I get an error back that the data failed to upload due to some data points being outwith the acceptable bounds.

I've spent a bit of time going through the responses on this post and this one to see if there's some out of the box solution to account for these out of bounds timezone offsets.

None of the iso functions mentioned on the various packages helped me. I'm thinking I might have to apply a lambda function to each timestamp with some if_else() logic to read the offset and if the offset is below 12, then make it just 12, and if the offset is greater than 14, then just set it to 14.

How can I do that assuming that my dataframe is called pages and my timestamp column is called dimension3?

Other, more elegant solutions also welcome.


Solution

  • I used a custom function with try except within a lambda:

    def rogue_tz_offsets(t):
        """
        try to convert to timestamp and if it fails remove timezone offset
        """
        t = dateutil.parser.parse(t)
        try:
            return t.isoformat()
        except:
            t = t.replace(tzinfo = None)
            return t.isoformat()
    

    And then

    pages['dimension3'] = pages['dimension3'].apply(lambda x: rogue_tz_offsets(x))