I am trying to derive DOB in the date datatype format (YYYY-MM-DD) in pyspark from a column that has the info reversed, as a large integer, and with single digit day/month's leading zeroes removed. This last point means the data varies between 6-8 digits in length. There are two circumstances for 7 digit numbers where deriving isn't possible due to ambiguity, I'm happy to output null values for these cases.
Examples:
Ambiguous examples:
It's quite complicated logic to code which is a bit above me. I'm thinking I could derive new year, month and day columns before deriving the DOB. First derive day and year cols whilst simultaneously dropping these digits from the merged digits col to leave the month.
Year = last 4 digits.
Day:
Before:
merged digits | Day | Year
1232000 | |
After:
merged digits | Day | Year
3 | 12 | 2000
Just an idea. Thanks for your help and ideas!
Just following your logic :
from pyspark.sql import functions as F
F.when(
F.size("merged_digits") == 6,
F.Array(
F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
F.lpad(F.substring("merged_digits", 2, 1), 2, "0"),
F.substring("merged_digits", 3, 4),
),
).when(
F.size("merged_digits") == 8,
F.Array(
F.substring("merged_digits", 1, 2),
F.substring("merged_digits", 3, 2),
F.substring("merged_digits", 5, 4),
),
).when(
F.substring("merged_digits", 1, 1) == "0",
F.Array(
F.substring("merged_digits", 1, 2),
F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
F.substring("merged_digits", 4, 4),
),
).when(
F.substring("merged_digits", 2, 1).cast("int") >= 2,
F.Array(
F.substring("merged_digits", 1, 2),
F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
F.substring("merged_digits", 4, 4),
),
).when(
(F.substring("merged_digits", 2, 1) == "1")
& (F.substring("merged_digits", 3, 1) == "0"),
F.Array(
F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
F.substring("merged_digits", 2, 2),
F.substring("merged_digits", 4, 4),
),
)
The output I created is an array with first element as day, second as month, and last as year, all left-padded with 0 to have the same format.