Search code examples
pysparkdatabricks

PySpark datetime patterns with day-of-week


I having troubles parsing datetime strings containing week-of-day with to_timestamp function:

According to Spark documentation, string Fri, 23 Aug 2024 12:11:16 GMT should be parsed with pattern EEE, dd MMM yyyy HH:mm:ss 'GMT' however I am getting SparkUpgradeException:

SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.DATETIME_PATTERN_RECOGNITION] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to recognize 'EEE, dd MMM yyyy HH:mm:ss \'GMT\'' pattern in the DateTimeFormatter.
1) You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0.
2) You can form a valid datetime pattern with the guide from 'https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html'.

The problem seems to be in the day-of-week part of the string, because pattern dd MMM yyyy HH:mm:ss 'GMT' works fine for 23 Aug 2024 12:11:16 GMT

Is there a way to parse this datetime string without setting LEGACY flag?

Spark version - 3.5.0, runtime - Databricks 14.3


Solution

  • The problem arises because the 'E' symbol cannot be used for timestamp parsing:

    Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime formatting, e.g. date_format. They are not allowed used for datetime parsing, e.g. to_timestamp.

    Since the day of the week doesn't matter for the timestamp, you should instead remove it with a regex:

    df = df.withColumn("col_cleaned", regexp_replace("col", "^[A-Za-z]{3}, ", ""))
    
    df = df.withColumn("col_parsed", to_timestamp("col_cleaned", "dd MMM yyyy HH:mm:ss 'GMT'"))