I need to get an expected dataframe in aws glue using pyspark showing at the end
#################Initial Dataframe#################
+---+--------------------+-------------------+
|_c0| _c1| time|
+---+--------------------+-------------------+
| 1| null|2020-05-30 19:36:32|
| 2|Mobii5 |2020-05-30 19:36:32|
| 3|Nooft biHi ooFrame 2|2020-05-30 19:36:32|
| 4|Samsung mobile ...|2020-05-30 19:36:32|
| 5|Samsung ppjomes ...|2020-05-30 19:36:32|
| 6| samsung GTP G Tv ne|2020-05-30 19:36:32|
| 7| all mightyPanasoci |2020-05-30 19:36:32|
| 8|Samsung hola .|2020-05-30 19:36:32|
| 9|Mooron phoines Mondo|2020-05-30 19:36:32|
| 10|Samsung Guru .......|2020-05-30 19:36:32|
+---+--------------------+-------------------+
Below is my code
time_udf = udf(lambda x: year(x), IntegerType())
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
new_df = df.withColumn('time',unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
print(new_df.show(10))
time.sleep(30)
df1 = new_df.withColumn('Year',time_udf(col("time")))
df1.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
print(df1.printSchema())
return(df1)
Need to get an output like shown above using aws UDF pyspark
###############Expected###################
+---+--------------------+-------------------+----+
|_c0| _c1| time|Year|
+---+--------------------+-------------------+----+
| 1| null|2020-05-29 20:07:58|2020|
| 2|Mobiistar Prime 5...|2020-05-29 20:07:58|2020|
| 3|NTT Hikari i-Frame 2|2020-05-29 20:07:58|2020|
| 4|Samsung SM-P605K ...|2020-05-29 20:07:58|2020|
| 5|Samsung SM-G850W ...|2020-05-29 20:07:58|2020|
| 6|samsung GTP G Tv ne |2020-05-29 20:07:58|2020|
| 7|all mightyPanasoci |2020-05-29 20:07:58|2020|
| 8|Samsung hola .|2020-05-29 20:07:58|2020|
| 9|Mooron phoines Mondo|2020-05-29 20:07:58|2020|
| 10|Samsung Guru .......|2020-05-29 20:07:58|2020|
+---+--------------------+-------------------+----+
I can get it right using this below line
df1 = new_df.withColumn('Year',year(new_df.time))
But i need to use UDF as requirments
You can't use year
inside UDF, as it is pyspark function.
If you really need to use UDF, you can do it with usual python datetime functions:
from datetime import datetime
def extractYear(datestring):
dt = datetime.strptime(datestring, '%Y-%m-%d %H:%M:%S')
return dt.year
time_udf = udf(lambda x: extractYear(x), IntegerType())
But using year
, like in .withColumn('Year',year(new_df.time))
would be just easier and quicker, so if it works - better stick to it.