I have a data frame with a date column where there are different type of date formats present. I want to validate and extract only the records with date format 'MM-dd-yy'
but when I validate and extract I am also getting the records with format ''MM-dd-yyyy'.
What is the correct format to extract the records with format ''MM-dd-yy'?
from pyspark.sql import functions as F,Window
df = sc.parallelize([['12-21-20'],
['05-30-2020'],
['01-01-1984'],
['12-24-20']]).toDF(["Date"])
df.show()
+----------+
| Date|
+----------+
| 12-21-20|
|05-30-2020|
|01-01-1984|
| 12-24-20|
+----------+
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df=df.filter(~F.col("Date").isNotNull()|
to_date(F.col("Date"),'MM-dd-yy').isNotNull())
display(df)
+----------+
| Date|
+----------+
| 12-21-20|
|05-30-2020|
|01-01-1984|
| 12-24-20|
+----------+
Expected output:
+----------+
| Date|
+----------+
| 12-21-20|
| 12-24-20|
+----------+
I found a simple way not using to_date
but rather regexp_extract
df = df.filter((F.regexp_extract('Date', '(\d{2}-\d{2}-\d{2})$', 1) != '') &
(F.to_date('Date', 'MM-dd-yy').isNotNull()))
df.show()
+--------+
| Date|
+--------+
|12-21-20|
|12-24-20|
+--------+