I have a source file which is length-delimited. My file looks as follows:
00;12345678;03122019
01;12345678;00000000
My code so far is as follows:
import pandas as pd
from datetime import datetime
col_lengths = {'Column1': range(0, 2),
'Column2': range(3, 11),
'Datecolumn': range(12, 20),
}
col_lengths = {k: set(v) for k, v in col_lengths.items()}
pd.read_fwf(r'D:\test.txt', colspecs=[(min(x), max(x)+1) for x in col_lengths.values()], header=None, names=col_lengths.keys(),
converters={'Column1':lambda x : str(x),
'Column2':lambda x : str(x),
'Datecolumn': lambda s: datetime.strptime(s, '%d%m%Y'),
}
)
Problem here is that with the second line of my file I get an error:
ValueError: time data '00000000' does not match format '%d%m%Y'
What is the correct way / a good way to handle this? I have no experience so far with this. The only alternative for me would be to take it as a string and later on try in a pandas dataframe to convert it. But if there would be an elegant way to handle this directly here, I would like to use it. I wondered how I could handle it here on the fly and if this would be a good idea?
You could use pandas.to_datetime
as converter:
#converters={...
'Datecolumn': lambda s: pd.to_datetime(s, format='%d%m%Y', errors='coerce'),
# }
output:
Column1 Column2 Datecolumn
0 00 12345678 2019-12-03
1 01 12345678 NaT
read the column as str
and convert afterwards:
df['Datecolumn'] = pd.to_datetime(df['Datecolumn'], format='%d%m%Y', errors='coerce')