Search code examples
apache-sparkdatevalidationpysparktypeerror

Validate date column for PySpark dataframe


I want to validate a date column for a PySpark dataframe. I know how to do it for pandas, but can't make it work for PySpark.

import pandas as pd
import datetime
from datetime import datetime
data = [['Alex',10, '2001-01-12'],['Bob',12, '2005-10-21'],['Clarke',13, '2003-12-41']]
df = pd.DataFrame(data,columns=['Name','Sale_qty', 'DOB'])
sparkDF =spark.createDataFrame(df) 


def validate(date_text):
    try:
        if date_text != datetime.strptime(date_text, "%Y-%m-%d").strftime('%Y-%m-%d'):
            raise ValueError
        return True
    except ValueError:
        return False

df = df['DOB'].apply(lambda x: validate(x))     

print(df)

It works for pandas dataframe. But I can't make it work for PySpark. Getting the following error:

sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))
TypeError                                 Traceback (most recent call last)
<ipython-input-83-5f5f1db1c7b3> in <module>
----> 1 sparkDF = sparkDF['DOB'].apply(lambda x: validate(x))

TypeError: 'Column' object is not callable

Solution

  • You could use the following column expression:

    F.to_date('DOB', 'yyyy-M-d').isNotNull()
    

    Full test:

    from pyspark.sql import functions as F
    
    data = [['Alex', 10, '2001-01-12'], ['Bob', 12, '2005'], ['Clarke', 13, '2003-12-41']]
    df = spark.createDataFrame(data, ['Name', 'Sale_qty', 'DOB'])
    
    validation = F.to_date('DOB', 'yyyy-M-d').isNotNull()
    
    df.withColumn('validation', validation).show()
    # +------+--------+----------+----------+
    # |  Name|Sale_qty|       DOB|validation|
    # +------+--------+----------+----------+
    # |  Alex|      10|2001-01-12|      true|
    # |   Bob|      12|      2005|     false|
    # |Clarke|      13|2003-12-41|     false|
    # +------+--------+----------+----------+