Search code examples
datepysparkdatabricksdate-formatdate-formatting

Validate and change the date formats in pyspark


I have a date column with different date fomrats. Now I want to validate it with a particular format('MM-dd-yyyy') and which are not matching has to be date formated to the desired one.

df = sc.parallelize([['12-21-2006'],
                     ['05/30/2007'],
                     ['01-01-1984'],
                     ['22-12-2017'],
                     ['12222019']]).toDF(["Date"])
df.show()
+----------+
|      Date|
+----------+
|12-21-2006|
|05/30/2007|
|01-01-1984|
|22-12-2017|
|  12222019|
+----------+

Now to validate,

correct=df.filter(~F.col("Date").isNotNull()|
                                           to_date(F.col("Date"),'MM-dd-yyyy').isNotNull())
correct.show()

+----------+
|      Date|
+----------+
|12-21-2006|
|01-01-1984|
+----------+

Now, I extracted wrong records which are as follows:-

wrong = df.exceptAll(correct)
wrong.show()

+----------+
|      Date|
+----------+
|05/30/2007|
|  12222019|
|22-12-2017|
+----------+

Now these wrong records has to be date formatted to the desired format which is

'MM-dd-yyyy'

If it is single I could have changed the format by specifying that particular format but how do I convert different date format into a desired date format? Is there any solution for this?


Solution

  • You could try out the different time formats in different columns and then take the first non-null value using coalesce:

    df.withColumn("d1", F.to_date(F.col("Date"),'MM-dd-yyyy')) \
      .withColumn("d2", F.to_date(F.col("Date"),'MM/dd/yyyy')) \
      .withColumn("d3", F.to_date(F.col("Date"),'dd-MM-yyyy')) \
      .withColumn("d4", F.to_date(F.col("Date"),'MMddyyyy')) \
      .withColumn("result", F.coalesce("d1", "d2", "d3", "d4")) \
      .show()
    

    Output:

    +----------+----------+----------+----------+----------+----------+
    |      Date|        d1|        d2|        d3|        d4|    result|
    +----------+----------+----------+----------+----------+----------+
    |12-21-2006|2006-12-21|      null|      null|      null|2006-12-21|
    |05/30/2007|      null|2007-05-30|      null|      null|2007-05-30|
    |01-01-1984|1984-01-01|      null|1984-01-01|      null|1984-01-01|
    |22-12-2017|      null|      null|2017-12-22|      null|2017-12-22|
    |  12222019|      null|      null|      null|2019-12-22|2019-12-22|
    +----------+----------+----------+----------+----------+----------+