Search code examples
amazon-web-servicesapache-sparkaws-gluepyspark

Aws Glue pypark UDF is throwing error An error occurred while calling o104.showString. Traceback (most recent call last)


I need to get an expected dataframe in aws glue using pyspark showing at the end

    #################Initial Dataframe#################
    +---+--------------------+-------------------+
    |_c0|                 _c1|               time|
    +---+--------------------+-------------------+
    |  1|                null|2020-05-30 19:36:32|
    |  2|Mobii5              |2020-05-30 19:36:32|
    |  3|Nooft biHi ooFrame 2|2020-05-30 19:36:32|
    |  4|Samsung mobile   ...|2020-05-30 19:36:32|
    |  5|Samsung ppjomes  ...|2020-05-30 19:36:32|
    |  6| samsung GTP G Tv ne|2020-05-30 19:36:32|
    |  7| all mightyPanasoci |2020-05-30 19:36:32|
    |  8|Samsung hola       .|2020-05-30 19:36:32|
    |  9|Mooron phoines Mondo|2020-05-30 19:36:32|
    | 10|Samsung Guru .......|2020-05-30 19:36:32|
    +---+--------------------+-------------------+

Below is my code

    time_udf = udf(lambda x: year(x), IntegerType())

    timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
                new_df = df.withColumn('time',unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
                print(new_df.show(10))
                time.sleep(30)
    df1 = new_df.withColumn('Year',time_udf(col("time")))
    df1.createOrReplaceTempView("people")
                sqlDF = spark.sql("SELECT * FROM people")
                sqlDF.show()
                print(df1.printSchema())
                return(df1)

Need to get an output like shown above using aws UDF pyspark

    ###############Expected###################
    +---+--------------------+-------------------+----+
    |_c0|                 _c1|               time|Year|
    +---+--------------------+-------------------+----+
    |  1|                null|2020-05-29 20:07:58|2020|
    |  2|Mobiistar Prime 5...|2020-05-29 20:07:58|2020|
    |  3|NTT Hikari i-Frame 2|2020-05-29 20:07:58|2020|
    |  4|Samsung SM-P605K ...|2020-05-29 20:07:58|2020|
    |  5|Samsung SM-G850W ...|2020-05-29 20:07:58|2020|
    |  6|samsung GTP G Tv ne |2020-05-29 20:07:58|2020|
    |  7|all mightyPanasoci  |2020-05-29 20:07:58|2020|
    |  8|Samsung hola       .|2020-05-29 20:07:58|2020|
    |  9|Mooron phoines Mondo|2020-05-29 20:07:58|2020|
    | 10|Samsung Guru .......|2020-05-29 20:07:58|2020|
    +---+--------------------+-------------------+----+

I can get it right using this below line

    df1 = new_df.withColumn('Year',year(new_df.time))

But i need to use UDF as requirments


Solution

  • You can't use year inside UDF, as it is pyspark function.

    If you really need to use UDF, you can do it with usual python datetime functions:

    from datetime import datetime
    
    def extractYear(datestring):
       dt = datetime.strptime(datestring, '%Y-%m-%d %H:%M:%S')
       return dt.year
    
    time_udf = udf(lambda x: extractYear(x), IntegerType())
    

    But using year, like in .withColumn('Year',year(new_df.time)) would be just easier and quicker, so if it works - better stick to it.