I have the fallowing dataFrame:
Timestamp real time
0 17FEB20:23:59:50 0.003
1 17FEB20:23:59:55 0.003
2 17FEB20:23:59:57 0.012
3 17FEB20:23:59:57 02:54.8
4 17FEB20:24:00:00 0.03
5 18FEB20:00:00:00 0
6 18FEB20:00:00:02 54.211
7 18FEB20:00:00:02 0.051
How to convert the columns to datetime64
?
There're 2 things that is making this challengeable form me:
Timestamp
, index 4
has the value: 17FEB20:24:00:00
, which seems not to be a valid date-time (although it was output by a SAS program...).real time
don't fallow a pattern and seems it cannot be matched through a date_parser
.This is what I've tried to address the first column (Timestamp
):
data['Timestamp'] = pd.to_datetime(
data['Timestamp'],
format='%d%b%y:%H:%M:%S')
But due the value of the index 4 (17FEB20:24:00:00
) I get:
ValueError: time data '17FEB20:24:00:00' does not match format '%d%b%y:%H:%M:%S' (match)
. If I remove this line, it does work, but I have to find a way to address it, as my dataset have of thousands of lines and I cannot simply ignore them. Perhaps there's a way to convert it to zero hours of the next day?
Here's a snippet code to create the dataFrame sample as above to to gain some time working on the answer (if you need):
data = pd.DataFrame({
'Timestamp':[
'17FEB20:23:59:50',
'17FEB20:23:59:55',
'17FEB20:23:59:57',
'17FEB20:23:59:57',
'17FEB20:24:00:00',
'18FEB20:00:00:00',
'18FEB20:00:00:02',
'18FEB20:00:00:02'],
'real time': [
'0.003',
'0.003',
'0.012',
'02:54.8',
'0.03',
'0',
'54.211',
'0.051',
]})
Appreciate your help!
Here's how I addressed it:
Timestamp
, I've used this reply (Thanks @merit_2 for sharing it in the first comment).real time
, I parse using some conditions.Here's the code:
import os
import pandas as pd
from datetime import timedelta
# Parsing "real time" column:
## Apply mask '.000' to the microseconds
data['real time'] = [sub if len(sub.split('.')) == 1 else sub.split('.')[0]+'.'+'{:<03s}'.format(sub.split('.')[1]) for sub in data['real time'].values]
## apply mask over all '00:00:00.000'
placeholders = {
1: '00:00:00.00',
2: '00:00:00.0',
3: '00:00:00.',
4: '00:00:00',
5: '00:00:0',
6: '00:00:',
7: '00:00',
8: '00:0',
9: '00:',
10:'00',
11:'0'}
for cond_len in placeholders:
condition = data['real time'].str.len() == cond_len
data.loc[(condition),'real time'] = placeholders[cond_len] + data.loc[(condition),'real time']
# Parsing "Timestamp" column:
selrow = data['Timestamp'].str.contains('24:00')
data['Timestamp'] = data['Timestamp'].str.replace('24:00', '00:00')
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['Timestamp'] = data['Timestamp'] + selrow * timedelta(days=1)
# Convert to columns to datetime type:
data['Timestamp'] = pd.to_datetime(data['Timestamp'], format='%d%b%y:%H:%M:%S')
data['real time'] = pd.to_datetime(data['real time'], format='%H:%M:%S.%f')
# check results:
display(data)
display(data.dtypes)
Here's the output:
Timestamp real time
0 2020-02-17 23:59:50 1900-01-01 00:00:00.003
1 2020-02-17 23:59:55 1900-01-01 00:00:00.003
2 2020-02-17 23:59:57 1900-01-01 00:00:00.012
3 2020-02-17 23:59:57 1900-01-01 00:02:54.800
4 2020-02-18 00:00:00 1900-01-01 00:00:00.030
5 2020-02-18 00:00:00 1900-01-01 00:00:00.000
6 2020-02-18 00:00:02 1900-01-01 00:00:54.211
7 2020-02-18 00:00:02 1900-01-01 00:00:00.051
Timestamp datetime64[ns]
real time datetime64[ns]
Perhaps there's a clever way to do that, but for now it suits.