Search code examples
pythonpandasdataframedatetimeunix-timestamp

Create an integer timestamp that corresponds to a Pandas Timestamp's timezone


Suppose we have a dataset with a UNIX timestamp in milliseconds:

data = [
    {
      "unix_ts": 1669291200000,
      "val": 10
    },
    {
      "unix_ts": 1669291260000,
      "val": 25
    }
  ]

Which we convert to a Pandas dataframe with a Pandas timestamp (datetime) set to US/Eastern:

df = pd.DataFrame(data)
df['ET'] = pd.to_datetime(df['unix_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
    unix_ts         val ET
0   1669291200000   10  2022-11-24 07:00:00-05:00
1   1669291260000   25  2022-11-24 07:01:00-05:00

We can see that the ET time is 5 hours behind the UTC unix_ts

Suppose we want a new integer column with a value that corresponds with that -5 hours difference. Naively, we could do this:

df['adjusted_ts'] = df['unix_ts'] - (3600000 * 5)
# Include column to allow us to check the result by eye.
df['Check_ET'] = pd.to_datetime(df['adjusted_ts'], unit='ms').dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

Result:

    unix_ts         val ET                          adjusted_ts           Check_ET
0   1669291200000   10  2022-11-24 07:00:00-05:00   1669273200000           2022-11-24 02:00:00-05:00
1   1669291260000   25  2022-11-24 07:01:00-05:00   1669273260000           2022-11-24 02:01:00-05:00

We can see from the Check_ET column that this "works", but it won't when we get to the part of the year when US/Eastern is only 4 hours behind UTC.

It would be handy to be able to do something like this:

import numpy as np
df['smart_adjusted_ts'] = (df['ET'].astype(np.int64) // 10 ** 9) * 1000

But, sadly, that's not so 'smart', as it results in a column that's identical to the original unix_ts (Regardless of the ET column's timezone, the underlying data (NumPy) is always stored as nanoseconds since the EPOCH in UTC.):

    unix_ts         ... Check_ET                    smart_adjusted_ts
0   1669291200000   ... 2022-11-24 02:00:00-05:00   1669291200000
1   1669291260000   ... 2022-11-24 02:01:00-05:00   1669291260000

So, unless there is a special method to do this (Anyone?), my only thought is to go back to the original approach but dynamically extract the UTC offset (ideally as ints: 4 or 5) from the ET column.

The problem is, I can't find how to do that either, but I'm hoping to achieve something like this:

df['adjusted_ts'] = df['unix_ts'] - (3600000 * et_utc_abs_diff)

Please be aware that a dataset could include dates with both (4 & 5 hour) differences, so it's important to get this difference on a row-by-row basis as opposed to having a master variable set to 4 or 5.

Any ideas for an elegant solution, please?

Edit

I came up with the following, and it gets the right result, but I suspect there must be a better way using standard Pandas methods.

df['adjusted_ts'] = df['unix_ts'] - (df['ET'].astype(str).str.slice(start=21, stop=22).astype(int) * 3600000)

Solution

  • Here's a way to implement this by localizing to None, as I've described in the comments.

    import pandas as pd
    
    df = pd.DataFrame({"unix_ts": [1651363200000, 1669291260000],
                       "val": [10, 25]})
    
    df["ET"] = pd.to_datetime(df["unix_ts"], unit='ms', utc=True).dt.tz_convert("America/New_York")
    # df["ET"]
    # 0   2022-04-30 20:00:00-04:00
    # 1   2022-11-24 07:01:00-05:00
    # Name: ET, dtype: datetime64[ns, America/New_York]
    
    # we can remove the time zone to get naive datetime. pandas will treat this as UTC
    df["ET_naive"] = df["ET"].dt.tz_localize(None)
    # df
    #          unix_ts  val                        ET            ET_naive
    # 0  1669291200000   10 2022-11-24 07:00:00-05:00 2022-11-24 07:00:00
    # 1  1669291260000   25 2022-11-24 07:01:00-05:00 2022-11-24 07:01:00
    
    # now we can convert back to units of time since the epoch, 
    # only that the epoch is now eastern time:
    df["ET_epochtime"] = df["ET_naive"].astype("int64") / 1e6 # division gives milliseconds
    
    # df["ET_epochtime"] correctly accounts for DST offset:
    (df["unix_ts"]-df["ET_epochtime"])/3600000
    # 0    4.0
    # 1    5.0
    # dtype: float64