I have imported a CSV file by parsing it through pandas, then I converted the csv file to parquet format.
The data read Time as bigint and date as String
I can change the date using the to_date function i.e. df.withColumn('ObservationDate', to_date('ObservationDate'))
.
However, I am struggling to change the Time column to HH format, and I also struggling to add the hours to the Date column. I have looked at the spark documents, but I am not able to find what I am looking for.
Another alternative-
lpad
to convert in hour to **HH
** format & **unix_timestamp
addition by setting UTC
as spark session timezone
df.show(false)
df.printSchema()
/**
* +---------------+-------------------+
* |ObservationTime|ObservationDate |
* +---------------+-------------------+
* |1 |2016-02-01T00:00:00|
* |12 |2016-02-01T00:00:00|
* +---------------+-------------------+
*
* root
* |-- ObservationTime: integer (nullable = false)
* |-- ObservationDate: string (nullable = true)
*/
spark.conf.set("spark.sql.session.timeZone", "UTC")
df.withColumn("ObservationTime", lpad($"ObservationTime", 2, "0"))
.withColumn("new_ObservationDate",
(unix_timestamp($"ObservationDate".cast("timestamp")).cast("long") + unix_timestamp($"ObservationTime", "HH").cast("long"))
.cast("timestamp")
)
.show(false)
/**
* +---------------+-------------------+-------------------+
* |ObservationTime|ObservationDate |new_ObservationDate|
* +---------------+-------------------+-------------------+
* |01 |2016-02-01T00:00:00|2016-02-01 01:00:00|
* |12 |2016-02-01T00:00:00|2016-02-01 12:00:00|
* +---------------+-------------------+-------------------+
*/