I'm trying to set up a Python script that will be able to read in many fixed width data files and then convert them to csv. To do this I'm using pandas like this:
pandas.read_fwf('source.txt', colspecs=column_position_length).\
to_csv('output.csv', header=column_name, index=False, encoding='utf-8')
Where column_position_length
and column_name
are lists containing the information needed to read and write the data.
Within these files I have long strings of numbers representing test answers. For instance: 333133322122222223133313222222221222111133313333
represents the correct answers on a multiple choice test. So this is more of a code than a numeric value. The problem that I am having is pandas interpreting these values as floats and then writing these values in scientific notation into the csv (3.331333221222221e+47).
I found a lot of questions regarding this issue, but they didn't quite resolve my issue.
dtype
is not supported as an argument for read_fwf
in Python.The second option seemes to be the go to answer for reading every column in as a string, but unfortunately it just isn't supported for read_fwf
. Any suggestions?
So I think I figured out a solution, but I don't know why it works. Pandas was interpreting these values as floats because there were NaN values (blank lines) in the columns. By adding keep_default_na=False
to the read_fwf()
parameters, it resolved this issue. According to the documentation:
keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
I guess I'm not quite understanding how this is fixing my issue. Could anyone add any clarity on this?