Search code examples
pythonamazon-web-servicesdataframepysparkdate-conversion

convert weird date string into date format in pyspark


I have a csv file with the date column. The date is coming in a strange format. Below are some examples:

  • May the 9th of 2022
  • September the 17th of 2022
  • June the 09th of 2022

I am creating a glue job to load data into Redshift.

How do I convert these weird looking string into YYYY-MM-DD format using pyspark data frame.


Solution

  • you can use to_date and pass the source format -- "MMMM 'the' d'th of' yyyy".

    As brought up by blackbishop, use "MMMM 'the' dd['st']['nd']['rd']['th'] 'of' yyyy" to handle all cases (1st, 2nd, 3rd, 4th ...)

    spark.sparkContext.parallelize([('May the 9th of 2022',), ('September the 17th of 2022',)]).toDF(['dt_str']). \
        withColumn('dt', func.to_date('dt_str', "MMMM 'the' d'th of' yyyy")). \
        show(truncate=False)
    
    # +--------------------------+----------+
    # |dt_str                    |dt        |
    # +--------------------------+----------+
    # |May the 9th of 2022       |2022-05-09|
    # |September the 17th of 2022|2022-09-17|
    # +--------------------------+----------+