Search code examples
pyspark

How to compare a timestamp column to an arbitrary timestamp format string in pyspark?


We want to use dynamic/string/sql based filters for dataframe.filter() due to the needs of the internal framework. This is tricky with timestamps with timezone. Consider the following:

dataframe.filter("end_dt > '07-12-2023T00:00:00')

I have tried a large variety of techniques including to_timestamp, cast .. as string and others, but none work for this. Any pointers?


Solution

  • You can add the time zone to your timestamp literal, parse it and then compare to a UTC timestamp.

    from datetime import datetime
    
    spark.conf.set('spark.sql.session.timeZone', 'UTC')
    
    lit_with_tz = F.lit('07-12-2023T00:00:00' + ' America/Los_Angeles')
    cutoff = F.to_timestamp(F.lit(lit_with_tz), "dd-MM-yyyy'T'HH:mm:ss VV")
    
    _data = [
        (datetime(2023, 12, 7, 7), ),
        (datetime(2023, 12, 7, 9), ),
    ]
    df = spark.createDataFrame(_data, ['end_dt'])
    cols = [
        'end_dt',
        cutoff.alias('cutoff'),
        (F.col('end_dt') > cutoff).alias('is_after')
    ]
    df.select(cols).show(10, False)
    
    # +-------------------+-------------------+--------+
    # |end_dt             |cutoff             |is_after|
    # +-------------------+-------------------+--------+
    # |2023-12-07 07:00:00|2023-12-07 08:00:00|false   |
    # |2023-12-07 09:00:00|2023-12-07 08:00:00|true    |
    # +-------------------+-------------------+--------+