Search code examples
pythonpandasdataframepython-datetimedateparser

How to convert Pandas column into date type when values don't respect a pattern?


I have the fallowing dataFrame:

    Timestamp           real time
0   17FEB20:23:59:50    0.003
1   17FEB20:23:59:55    0.003
2   17FEB20:23:59:57    0.012
3   17FEB20:23:59:57    02:54.8
4   17FEB20:24:00:00    0.03
5   18FEB20:00:00:00    0
6   18FEB20:00:00:02    54.211
7   18FEB20:00:00:02    0.051

How to convert the columns to datetime64?

There're 2 things that is making this challengeable form me:

  1. The column Timestamp, index 4 has the value: 17FEB20:24:00:00, which seems not to be a valid date-time (although it was output by a SAS program...).
  2. The column real time don't fallow a pattern and seems it cannot be matched through a date_parser.

This is what I've tried to address the first column (Timestamp):

data['Timestamp'] = pd.to_datetime(
    data['Timestamp'],
    format='%d%b%y:%H:%M:%S')

But due the value of the index 4 (17FEB20:24:00:00) I get: ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match). If I remove this line, it does work, but I have to find a way to address it, as my dataset have of thousands of lines and I cannot simply ignore them. Perhaps there's a way to convert it to zero hours of the next day?

Here's a snippet code to create the dataFrame sample as above to to gain some time working on the answer (if you need):

data = pd.DataFrame({
    'Timestamp':[
        '17FEB20:23:59:50',
        '17FEB20:23:59:55',
        '17FEB20:23:59:57',
        '17FEB20:23:59:57',
        '17FEB20:24:00:00',
        '18FEB20:00:00:00',
        '18FEB20:00:00:02',
        '18FEB20:00:00:02'],
    'real time': [
        '0.003',
        '0.003',
        '0.012',
        '02:54.8',
        '0.03',
        '0',
        '54.211',
        '0.051',
        ]})

Appreciate your help!


Solution

  • Here's how I addressed it:

    • For the column Timestamp, I've used this reply (Thanks @merit_2 for sharing it in the first comment).
    • For the column real time, I parse using some conditions.

    Here's the code:

    import os
    import pandas as pd
    from datetime import timedelta
    
    # Parsing "real time" column:
    
    ## Apply mask '.000' to the microseconds
    data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]
    
    ## apply mask over all '00:00:00.000'
    placeholders = {
        1: '00:00:00.00',
        2: '00:00:00.0',
        3: '00:00:00.',
        4: '00:00:00',
        5: '00:00:0',
        6: '00:00:',
        7: '00:00',
        8: '00:0',
        9: '00:',
        10:'00',
        11:'0'}
    
    for cond_len in placeholders:
        condition = data['real time'].str.len() == cond_len
        data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']
    
    # Parsing "Timestamp" column:
    selrow = data['Timestamp'].str.contains('24:00')
    data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
    data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
    data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)
    
    # Convert to columns to datetime type:
    data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
    data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')
    
    # check results:
    display(data)
    display(data.dtypes)
    

    Here's the output:

        Timestamp           real time
    0   2020-02-17 23:59:50 1900-01-01 00:00:00.003
    1   2020-02-17 23:59:55 1900-01-01 00:00:00.003
    2   2020-02-17 23:59:57 1900-01-01 00:00:00.012
    3   2020-02-17 23:59:57 1900-01-01 00:02:54.800
    4   2020-02-18 00:00:00 1900-01-01 00:00:00.030
    5   2020-02-18 00:00:00 1900-01-01 00:00:00.000
    6   2020-02-18 00:00:02 1900-01-01 00:00:54.211
    7   2020-02-18 00:00:02 1900-01-01 00:00:00.051
    
    Timestamp    datetime64[ns]
    real time    datetime64[ns]
    

    Perhaps there's a clever way to do that, but for now it suits.