The Spark Dataframe has a column where the value is GregorianCalendar as below:
java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=7,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=18,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=9,HOUR_OF_DAY=9,MINUTE=1,SECOND=52,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]
How can I get the exact value of timestamp from this?
The actual value of the column is "2017-07-18 09:01:52".
What about regexp_replace
?
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=7,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=18,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=9,HOUR_OF_DAY=9,MINUTE=1,SECOND=52,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]',)],
['GregorianCalendar']
)
pattern = r'(?:.*)YEAR=(\d+).+?MONTH=(\d+).+?DAY_OF_MONTH=(\d+).+?HOUR=(\d+).+?MINUTE=(\d+).+?SECOND=(\d+).+'
df = df.withColumn('timest', F.regexp_replace('GregorianCalendar', pattern, '$1-$2-$3 $4:$5:$6').cast('timestamp'))
df.show()
# +--------------------+-------------------+
# | GregorianCalendar| timest|
# +--------------------+-------------------+
# |java.util.Gregori...|2017-07-18 09:01:52|
# +--------------------+-------------------+