amazon-web-services apache-spark aws-glue pyspark

Aws Glue pypark UDF is throwing error An error occurred while calling o104.showString. Traceback (most recent call last)

I need to get an expected dataframe in aws glue using pyspark showing at the end

    #################Initial Dataframe#################
    +---+--------------------+-------------------+
    |_c0|                 _c1|               time|
    +---+--------------------+-------------------+
    |  1|                null|2020-05-30 19:36:32|
    |  2|Mobii5              |2020-05-30 19:36:32|
    |  3|Nooft biHi ooFrame 2|2020-05-30 19:36:32|
    |  4|Samsung mobile   ...|2020-05-30 19:36:32|
    |  5|Samsung ppjomes  ...|2020-05-30 19:36:32|
    |  6| samsung GTP G Tv ne|2020-05-30 19:36:32|
    |  7| all mightyPanasoci |2020-05-30 19:36:32|
    |  8|Samsung hola       .|2020-05-30 19:36:32|
    |  9|Mooron phoines Mondo|2020-05-30 19:36:32|
    | 10|Samsung Guru .......|2020-05-30 19:36:32|
    +---+--------------------+-------------------+

Below is my code

    time_udf = udf(lambda x: year(x), IntegerType())

    timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
                new_df = df.withColumn('time',unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
                print(new_df.show(10))
                time.sleep(30)
    df1 = new_df.withColumn('Year',time_udf(col("time")))
    df1.createOrReplaceTempView("people")
                sqlDF = spark.sql("SELECT * FROM people")
                sqlDF.show()
                print(df1.printSchema())
                return(df1)

Need to get an output like shown above using aws UDF pyspark

    ###############Expected###################
    +---+--------------------+-------------------+----+
    |_c0|                 _c1|               time|Year|
    +---+--------------------+-------------------+----+
    |  1|                null|2020-05-29 20:07:58|2020|
    |  2|Mobiistar Prime 5...|2020-05-29 20:07:58|2020|
    |  3|NTT Hikari i-Frame 2|2020-05-29 20:07:58|2020|
    |  4|Samsung SM-P605K ...|2020-05-29 20:07:58|2020|
    |  5|Samsung SM-G850W ...|2020-05-29 20:07:58|2020|
    |  6|samsung GTP G Tv ne |2020-05-29 20:07:58|2020|
    |  7|all mightyPanasoci  |2020-05-29 20:07:58|2020|
    |  8|Samsung hola       .|2020-05-29 20:07:58|2020|
    |  9|Mooron phoines Mondo|2020-05-29 20:07:58|2020|
    | 10|Samsung Guru .......|2020-05-29 20:07:58|2020|
    +---+--------------------+-------------------+----+

I can get it right using this below line

    df1 = new_df.withColumn('Year',year(new_df.time))

But i need to use UDF as requirments

Solution

You can't use year inside UDF, as it is pyspark function.

If you really need to use UDF, you can do it with usual python datetime functions:

from datetime import datetime

def extractYear(datestring):
   dt = datetime.strptime(datestring, '%Y-%m-%d %H:%M:%S')
   return dt.year

time_udf = udf(lambda x: extractYear(x), IntegerType())

But using year, like in .withColumn('Year',year(new_df.time)) would be just easier and quicker, so if it works - better stick to it.