I'm reading CSV data to be used as a pandas dataframe, but the CSV does not appear to follow any sane convention (besides using ;
as delimiter, as everyone should...). It appears that the sole goal was to make them look good when opened in a text editor
Here's some examples (set up as variables, so that they can be used for the reader examples):
ex1="""
Type: Some Metadata
User: More Metadata
Data:
01.10.1939 00:00:00 ; 1,1 ;
01.12.1939 00:00:00 ; 1 ;
01.01.1940 00:00:00 ; 10 ;
"""
OK, decimal comma (easy), semicolon delimiter (easy), dayfirst (easy) and a bunch of metadata (skiprows, also easy).
ts = pd.read_csv(io.StringIO(ex1), skiprows=4, decimal=',', sep=';',
index_col=0, usecols=[0,1], dayfirst=True,
parse_dates=True, names=['date', 'val'])
print(ts
results in a nice dataframe
val
date
1939-10-01 1.1
1939-12-01 1.0
1940-01-01 10.0
and ts.index
is a nice DatetimeIndex
and type(ts.val[0])
is a numpy.float64
, as it should be. But let's introduce a creative way to mark NaN
:
ex2="""
Type: Some Metadata
User: More Metadata
Data:
01.10.1939 00:00:00 ; 1,1;
01.12.1939 00:00:00 ; NÄ ;
01.01.1940 00:00:00 ; 10 ;
"""
The above ts=read_csv...
still works without error, but NÄ
breaks the val
column and turns it to strings. But when I change this to
ts = pd.read_csv(io.StringIO(ex2), skiprows=4, decimal=',', sep=';',
index_col=0, usecols=[0,1], dayfirst=True,
parse_dates=True, names=['date', 'val'],
na_values='NÄ')
using na_values
, the whole thing fails. print(ts)
val
date
1939-10-01 1,1
1939-12-01 NÄ
1940-01-01 10
Not only did it not accept NÄ
as NaN
, this also turns all val
s into strings, thus ignores the decimal comma and keeps the trailing spaces. ts.val[0]
is now ' 1,1'
, so a simple ts.val = ts.val.astype(float)
of course fails.
What am I doing wrong with na_values='NÄ'
?
Why does it also break decimal','
and adds the spaces?
It seems like skipinitialspace=True
should help, but of course NÄ
still breaks the val
column.
sep='\s*[;]s*'
seems promising, and
ts = pd.read_csv(io.StringIO(ex2), skiprows=4, decimal=',' ,sep='\s*[;]s*',
index_col=0, usecols=[0,1], dayfirst=True,
parse_dates=True, names=['date', 'val'],
na_values='NÄ')
gives an apparently nice print(ts)
val
date
1939-10-01 1.1
1939-12-01 NÄ
1940-01-01 10
(note the decimal point!), but now I've got the weird situation that it did replace the comma, but ts.val[0]
now is a string again, and still has the trailing spaces (' 1.1'
).
So how do I read in those borked files?
The workaround I'm currently using is to read in the CSV with pure python (I have to read in the header (40 lines in the real files) anyways) and write it out into a proper CSV, to be read with pandas:
file = open(myfile, 'r', encoding='UTF-8')
table = file.readlines()
file.close()
for v1 in range(0, len(table)):
table[v1] = table[v1].replace("NÄ", "NaN")
table[v1] = table[v1].replace(",", ".")
dataoutput = ["date;val\r\n"]
for v1 in range(3, len(table)):
dataoutput.append(table[v1])
f2 = open(myfile.replace('.csv', 'good.csv'), 'w')
for v1 in range(0, len(dataoutput)):
f2.write(dataoutput[v1])
f2.close()
ts = pd.read_csv(myfile.replace('.csv', 'good.csv'),
sep=';', usecols=[0, 1], index_col=0,
dayfirst=True, parse_dates=True)
ts.val = ts.val.astype(float)
But with a few thousand CSV files, reaching up to a megabyte in size, this isn't really an optimal solution, so I'd like to solve the NÄ
issue in the import.
Your sep
is mis-specified (it ends with s*
instead of \s*
, which means it's looking for between 0 and infinite s
characters). That's why you're only capturing the leading and not the trailing spaces after the ;
. Incidentally, this was also interfering with (1), because you were trying to replace 'NÄ'
but the value was ' NÄ'
. Use sep='\s*\;\s*'
instead.
One thing you can do in future is to print out the offending values by themselves to make sure they contain what they think you contain, e.g. ts.iloc[1].val
.
Also, if the NaN
value being in unicode is a problem, you can strip it before parsing:
csv = io.StringIO(ex2.replace(u'N\xc4', '[MISSING]'))
ts = pd.read_csv(csv,
skiprows=4, decimal=',', index_col=0, usecols=[0,1],
dayfirst=True, parse_dates=True, names=['date', 'val'],
na_values='[MISSING]', sep='\s*\;\s*')
...which would give...
val
date
1939-10-01 1.1
1939-12-01 NaN
1940-01-01 10.0