We want to use dynamic/string/sql based filters for dataframe.filter()
due to the needs of the internal framework. This is tricky with timestamps with timezone. Consider the following:
dataframe.filter("end_dt > '07-12-2023T00:00:00')
I have tried a large variety of techniques including to_timestamp
, cast .. as string
and others, but none work for this. Any pointers?
You can add the time zone to your timestamp literal, parse it and then compare to a UTC timestamp.
from datetime import datetime
spark.conf.set('spark.sql.session.timeZone', 'UTC')
lit_with_tz = F.lit('07-12-2023T00:00:00' + ' America/Los_Angeles')
cutoff = F.to_timestamp(F.lit(lit_with_tz), "dd-MM-yyyy'T'HH:mm:ss VV")
_data = [
(datetime(2023, 12, 7, 7), ),
(datetime(2023, 12, 7, 9), ),
]
df = spark.createDataFrame(_data, ['end_dt'])
cols = [
'end_dt',
cutoff.alias('cutoff'),
(F.col('end_dt') > cutoff).alias('is_after')
]
df.select(cols).show(10, False)
# +-------------------+-------------------+--------+
# |end_dt |cutoff |is_after|
# +-------------------+-------------------+--------+
# |2023-12-07 07:00:00|2023-12-07 08:00:00|false |
# |2023-12-07 09:00:00|2023-12-07 08:00:00|true |
# +-------------------+-------------------+--------+