python python-2.7 apache-spark pyspark udf

How to do string transformation in pyspark?

I have data like this. I want to transform the low column to integers. For example, if it's 01:23.0, I want it to be 1*60 + 23 = 83.

How to do this? I've tried udf but it raised Py4JJavaError

df = sqlContext.createDataFrame([
    ('01:23.0', 'z', 'null'), 
    ('01:23.0', 'z', 'null'),  
    ('01:23.0', 'c', 'null'),
    ('null', 'null', 'null'),  
    ('01:24.0', 'null', '4.0')],
    ('low', 'high', 'normal'))

    def min2sec(v):
        if pd.notnull(v):
            return int(v[:2]) * 60 + int(v[3:5])

    udf_min2sec = udf(min2sec, IntegerType())
    df.withColumn('low', udf_min2sec(df['low'])).show()

Solution

You don't need an udf, you can use built-in functions to arrive at your expected output:

from pyspark.sql.functions import split, col

df.withColumn("test", split(col("low"),":").cast("array<int>")) \
  .withColumn("test", col("test")[0]*60 + col("test")[1]).show()
+-------+----+------+----+
|    low|high|normal|test|
+-------+----+------+----+
|01:23.0|   z|  null|  83|
|01:23.0|   z|  null|  83|
|01:23.0|   c|  null|  83|
|   null|null|  null|null|
|01:24.0|null|   4.0|  84|
+-------+----+------+----+