How to compare a timestamp column to an arbitrary timestamp format string in pyspark?

We want to use dynamic/string/sql based filters for dataframe.filter() due to the needs of the internal framework. This is tricky with timestamps with timezone. Consider the following:

dataframe.filter("end_dt > '07-12-2023T00:00:00')

I have tried a large variety of techniques including to_timestamp, cast .. as string and others, but none work for this. Any pointers?

Solution

You can add the time zone to your timestamp literal, parse it and then compare to a UTC timestamp.

from datetime import datetime

spark.conf.set('spark.sql.session.timeZone', 'UTC')

lit_with_tz = F.lit('07-12-2023T00:00:00' + ' America/Los_Angeles')
cutoff = F.to_timestamp(F.lit(lit_with_tz), "dd-MM-yyyy'T'HH:mm:ss VV")

_data = [
    (datetime(2023, 12, 7, 7), ),
    (datetime(2023, 12, 7, 9), ),
]
df = spark.createDataFrame(_data, ['end_dt'])
cols = [
    'end_dt',
    cutoff.alias('cutoff'),
    (F.col('end_dt') > cutoff).alias('is_after')
]
df.select(cols).show(10, False)

# +-------------------+-------------------+--------+
# |end_dt             |cutoff             |is_after|
# +-------------------+-------------------+--------+
# |2023-12-07 07:00:00|2023-12-07 08:00:00|false   |
# |2023-12-07 09:00:00|2023-12-07 08:00:00|true    |
# +-------------------+-------------------+--------+