Search code examples
dataframepysparkdate-format

How to validate the date format of a column in Pyspark?


I am really new to Pyspark, I want to check if the column has the correct date format or not? How do I do it? I have tried though I am getting an error. Can anyone help me with this?

My code:

df = 
   Date        name
0  12/12/2020   a
1  24/01/2019   b
2  08/09/2018   c
3  12/24/2020   d
4  Nan          e
df_out= df.withColumn('output', F.when(F.to_date("Date","dd/mm/yyyy").isNotNull, Y).otherwise(No))
df_out.show()

gives me:

TypeError: condition should be a Column

Solution

  • You can filter out the rows after converting to date type.

    Example:

    df.show()
    #+----------+----+
    #|      Date|name|
    #+----------+----+
    #|12/12/2020|   a|
    #|24/01/2019|   b|
    #|12/24/2020|   d|
    #|       nan|   e|
    #+----------+----+
    
    from pyspark.sql.functions import *
    
    df.withColumn("output",to_date(col('Date'),'dd/MM/yyyy')).\
    filter(col("output").isNotNull()).\
    show()
    #+----------+----+----------+
    #|      Date|name|    output|
    #+----------+----+----------+
    #|12/12/2020|   a|2020-12-12|
    #|24/01/2019|   b|2019-01-24|
    #+----------+----+----------+
    
    #without adding new column
    df.filter(to_date(col('Date'),'dd/MM/yyyy').isNotNull()).show()
    #+----------+----+
    #|      Date|name|
    #+----------+----+
    #|12/12/2020|   a|
    #|24/01/2019|   b|
    #+----------+----+