Search code examples
pythondatepysparkstring-to-datetime

Derive dob / date of birth from merged numbers in pyspark or python?


I am trying to derive DOB in the date datatype format (YYYY-MM-DD) in pyspark from a column that has the info reversed, as a large integer, and with single digit day/month's leading zeroes removed. This last point means the data varies between 6-8 digits in length. There are two circumstances for 7 digit numbers where deriving isn't possible due to ambiguity, I'm happy to output null values for these cases.

Examples:

  • 831945 = 8th March 1945
  • 1232000 = 12th March 2000
  • 11102000 = 11th Oct 2000

Ambiguous examples:

  • 111YYYY = 1st Nov or 11th Jan
  • 112YYYY = 1st Dec or 11th Feb

It's quite complicated logic to code which is a bit above me. I'm thinking I could derive new year, month and day columns before deriving the DOB. First derive day and year cols whilst simultaneously dropping these digits from the merged digits col to leave the month.

Year = last 4 digits.

Day:

  • if len 6 then day = first digit
  • if len 7 and 2nd digit = 0 or >=2 then day = first two digits
  • if len 7 and 2nd digit = 1 and 3rd digit ==0 then day = first digit
  • if len 7 and 2nd digit = 1 and 3rd digit ==1,2 then output null
  • if len 7 and 2nd digit = 1 and 3rd digit >=3 then day = first two digits
  • if len 7 and 2nd digit =>2 then day = first two digits
  • if len 8 then day = first two digits

Before:

merged digits  | Day | Year
1232000        |     | 

After:

merged digits  | Day | Year
3              | 12  | 2000

Just an idea. Thanks for your help and ideas!


Solution

  • Just following your logic :

    from pyspark.sql import functions as F
    
    
    F.when(
        F.size("merged_digits") == 6,
        F.Array(
            F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
            F.lpad(F.substring("merged_digits", 2, 1), 2, "0"),
            F.substring("merged_digits", 3, 4),
        ),
    ).when(
        F.size("merged_digits") == 8,
        F.Array(
            F.substring("merged_digits", 1, 2),
            F.substring("merged_digits", 3, 2),
            F.substring("merged_digits", 5, 4),
        ),
    ).when(
        F.substring("merged_digits", 1, 1) == "0",
        F.Array(
            F.substring("merged_digits", 1, 2),
            F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
            F.substring("merged_digits", 4, 4),
        ),
    ).when(
        F.substring("merged_digits", 2, 1).cast("int") >= 2,
        F.Array(
            F.substring("merged_digits", 1, 2),
            F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
            F.substring("merged_digits", 4, 4),
        ),
    ).when(
        (F.substring("merged_digits", 2, 1) == "1")
        & (F.substring("merged_digits", 3, 1) == "0"),
        F.Array(
            F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
            F.substring("merged_digits", 2, 2),
            F.substring("merged_digits", 4, 4),
        ),
    )
    

    The output I created is an array with first element as day, second as month, and last as year, all left-padded with 0 to have the same format.