Search code examples
pandasstringseries

Changing digits in numbers based on a conditions


In Norway we have something called D- and S-numbers. These are National identification number where the day or month of birth are modified.

D-number
[d+4]dmmyy

S-number
dd[m+5]myy

I have a column with dates, some of them normal (ddmmyy) and some of them are formatted as D- or S-numbers. Leading zeroes are also missing.

df = pd.DataFrame({'dates': [241290,  #24.12.90
                             710586,  #31.05.86
                             105299,  #10.02.99
                              56187]  #05.11.87
                  })

    dates
0  241290
1  710586
2  105299
3   56187

I've written this function to add leading zero and convert the dates, but this solution doesn't feel that great.

def func(s):
    s = s.astype(str)
    res = []
    for index, value in s.items():
        
        # Make sure all dates have 6 digits (add leading zero)
        if len(value) == 5:
            value = ('0' + value)
        
        # Convert S- and D-dates to regular dates
        if int(value[0]) > 3:
            
            # substract 4 from the first digit
            res.append(str(int(value[0]) - 4) + value[1:])
        
        elif int(value[2]) > 1:
            # subtract 5 from the third digit
            res.append(value[:2] + str(int(value[2]) - 5) + value[3:])
        
        else:
            res.append(value)
            
    return pd.Series(res)

Is there a smoother and faster way of accomplishing the same result?


Solution

  • Normalize dates by padding with 0 then explode into 3 columns of two digits (day, month, year). Apply your rules and combine columns to a DateTimeIndex:

    # Suggested by @HenryEcker
    # Changed: .pad(6, fillchar='0')  to  .zfill(6)
    dates = df['dates'].astype(str).str.zfill(6).str.findall('(\d{2})') \
                       .apply(pd.Series).astype(int) \
                       .rename(columns={0: 'day', 1: 'month', 2: 'year'}) \
                       .agg({'day': lambda d: d if d <= 31 else d - 40,
                             'month': lambda m: m if m <= 12 else m - 50,
                             'year': lambda y: 1900 + y})
    
    df['dates2'] = pd.to_datetime(dates)
    

    Output:

    >>> df
        dates     dates2
    0  241290 1990-12-24
    1  710586 1986-05-31
    2  105299 1999-02-10
    3   56187 1987-11-05
    
    >>> dates
       day  month  year
    0   24     12  1990
    1   31      5  1986
    2   10      2  1999
    3    5     11  1987