python apache-spark pyspark apache-zeppelin

PySpark - converting hour and minute data to seconds

I have a given time of XXh:YYm (ex 1h:23m) that I'm trying to convert to seconds. The tricky part is that if it is less than an hour then the time will be given as just YYm (eg 52m).

I am currently using

%pyspark
newColumn = unix_timestamp(col("time"), "H:mm")

dataF.withColumn('time', regexp_replace('time', 'h|m', '')).withColumn("time", newColumn).show()

This works great for removing the h and m letters and then converting to seconds, but throws a null when the time is less than an hour as explained above since it's not actually on the H:mm format. What's a good approach to this? I keep trying different things that seems to overcomplicate it, and I still haven't found a solution.

I am leaning toward some sort of conditional like

if value contains 'h:' then newColumn = unix_timestamp(col("time"), "H:mm")
else  newColumn = unix_timestamp(col("time"), "mm")

but I am fairly new to pyspark and not sure how to do this to get the final output. I am basically looking for an approach that will convert a time to seconds and can handle formats of '1h:23m' as well as '53m'.

Solution

This should do the trick, assuming time column is stringtype. Just used when otherwise to separate the two different times(by contains 'h') and used substring to get desired minutes.

from pyspark.sql import functions as F
df.withColumn("seconds", F.when(F.col("time").contains("h"), F.unix_timestamp(F.regexp_replace("time", "h|m", ''),"H:mm"))\
.otherwise(F.unix_timestamp(F.substring("time",1,2),"mm")))\
.show()

+------+-------+
|  time|seconds|
+------+-------+
|1h:23m|   4980|
|   23m|   1380|
+------+-------+