Using Spark 3.1, I am trying to convert string type value ("MM/dd/yyyy") in into date format ("dd-MM-yyyy"). There is a total of 5 date columns in my file and I want to change them into proper date format ("dd-MM-yy") from ("MM/dd/yy"). There are some values in these columns which are already in date format like 05-02-2022 ("dd-MM-YYYY") and some values in 10/23/2021 ("MM-dd-yyyy") format. I want to convert only those values which are in "MM-dd-YYYY" format to "dd-MM-yyyy" format. How can I achieve this?
Input:
df = pd.DataFrame([[10/23/2019, 09/13/2021], [06/16/2020, 03/16/2021], [09/06/2022, 12/23/2019], columns=['A', 'B'])
Output will be like 23-10-2019, 13-09-2021
My code:
df = df.withColumn('date_col', to_date('Date_col', 'dd-MM-yy'))
The code is running fine, but it's returning undefined in output for date column. As I have 5 date columns, is it possible to do it using a for
loop?
Example input df (all the columns are of the string date type):
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('x', '01/31/2021', '12/31/2020'),
('y', '1/1/2020', '1/18/2020'),
('z', '15-01-2020', '20-10-2020')],
['id', 'c1', 'c2'])
The following will convert every column into date type:
for c in ['c1', 'c2']:
df = df.withColumn(
c,
F.when(F.col(c).rlike('\d{1,2}/\d{1,2}/\d{4}'), F.to_date(c, 'M/d/yyyy'))
.when(F.col(c).rlike('\d{1,2}-\d{1,2}-\d{4}'), F.to_date(c, 'd-M-yyyy'))
)
df.show()
# +---+----------+----------+
# | id| c1| c2|
# +---+----------+----------+
# | x|2021-01-31|2020-12-31|
# | y|2020-01-01|2020-01-18|
# | z|2020-01-15|2020-10-20|
# +---+----------+----------+
rlike
method checks if the string conforms to the specific regex pattern, and if it does, when
condition applies a function.
\d{1,2}/\d{1,2}/\d{4}
checks the format M/d/yyyy
\d{1,2}-\d{1,2}-\d{4}
checks the format d-M-yyyy
If you have some other format, you will get null. To not get null, you would need to include that other format into this function too.
Spark does not support date types formatted in some other way except for yyyy-MM-dd
. If you need another format, you will need to again convert the date type into string type, but with the format which you need:
for c in ['c1', 'c2']:
df = df.withColumn(
c,
F.date_format(
F.when(F.col(c).rlike('\d{1,2}/\d{1,2}/\d{4}'), F.to_date(c, 'M/d/yyyy'))
.when(F.col(c).rlike('\d{1,2}-\d{1,2}-\d{4}'), F.to_date(c, 'd-M-yyyy')),
"dd-MM-yyyy"
)
)
df.show()
# +---+----------+----------+
# | id| c1| c2|
# +---+----------+----------+
# | x|31-01-2021|31-12-2020|
# | y|01-01-2020|18-01-2020|
# | z|15-01-2020|20-10-2020|
# +---+----------+----------+