I need to transform a dataframe with a column of timestamps in Unixtime/LongType
-Format to actual TimestampType
.
According to epochconverter.com:
However, when I use from_unixtime
on the dataframe, I get a 12-hour clock and it basically subtracts 12 hours from my second timestamp for some reason? How can I tell PySpark to use a 24h clock?
The output of the code below is:
+---+----------+-------------------+
|id |mytime |mytime_new |
+---+----------+-------------------+
|ABC|1646732321|2022-03-08 10:38:41|
|DFG|1646768324|2022-03-08 08:38:44|
+---+----------+-------------------+
The second line should be 2022-03-08 20:38:44
.
Reproducible code example:
data = [
("ABC", 1646732321)
,
("DFG", 1646768324)
]
schema = StructType(
[
StructField("id", StringType(), True),
StructField("mytime", LongType(), True),
]
)
df = spark.createDataFrame(data, schema)
df = df.withColumn(
"mytime_new",
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
)
df.show(10, False)
Found my mistake 3 minutes later... the issue was my timestamp-format string for the hour (hh
):
Instead of:
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
I needed:
from_unixtime(df["mytime"], "yyyy-MM-dd HH:mm:ss"),