Search code examples
apache-sparkpysparkapache-spark-sqltimestampgregorian-calendar

Get timestamp in PySpark from GregorianCalendar


The Spark Dataframe has a column where the value is GregorianCalendar as below:

java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=7,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=18,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=9,HOUR_OF_DAY=9,MINUTE=1,SECOND=52,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]

How can I get the exact value of timestamp from this?
The actual value of the column is "2017-07-18 09:01:52".


Solution

  • What about regexp_replace?

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [('java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Etc/UTC",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=7,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=18,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=9,HOUR_OF_DAY=9,MINUTE=1,SECOND=52,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]',)],
        ['GregorianCalendar']
    )
    pattern = r'(?:.*)YEAR=(\d+).+?MONTH=(\d+).+?DAY_OF_MONTH=(\d+).+?HOUR=(\d+).+?MINUTE=(\d+).+?SECOND=(\d+).+'
    df = df.withColumn('timest', F.regexp_replace('GregorianCalendar', pattern, '$1-$2-$3 $4:$5:$6').cast('timestamp'))
    df.show()
    # +--------------------+-------------------+
    # |   GregorianCalendar|             timest|
    # +--------------------+-------------------+
    # |java.util.Gregori...|2017-07-18 09:01:52|
    # +--------------------+-------------------+