Search code examples
timepysparkunix-timestamp

PySpark round off timestamps to full hours?


I am interested in rounding off timestamps to full hours. What I got so far is to round to the nearest hour. For example with this:

df.withColumn("Full Hour", hour((round(unix_timestamp("Timestamp")/3600)*3600).cast("timestamp")))

But this "round" function uses HALF_UP rounding. This means: 23:56 results in 00:00 but I would instead prefer to have 23:00. Is this possible? I didn't find an option field how to set the rounding behaviour in the function.


Solution

  • I think you're overcomplicating things. Hour function returns by default an hour component of a timestamp.

    from pyspark.sql.functions import to_timestamp
    from pyspark.sql import Row
    
    df = (sc
        .parallelize([Row(Timestamp='2016_08_21 11_59_08')])
        .toDF()
        .withColumn("parsed", to_timestamp("Timestamp", "yyyy_MM_dd hh_mm_ss")))
    
    df2 = df.withColumn("Full Hour", hour(unix_timestamp("parsed").cast("timestamp")))
    
    df2.show()
    

    Output:

    +-------------------+-------------------+---------+
    |          Timestamp|             parsed|Full Hour|
    +-------------------+-------------------+---------+
    |2016_08_21 11_59_08|2016-08-21 11:59:08|       11|
    +-------------------+-------------------+---------+