Search code examples
pythonpandasfloating-pointscientific-notationfixed-width

Python Read fixed width files without any data type interpretation using Pandas


I'm trying to set up a Python script that will be able to read in many fixed width data files and then convert them to csv. To do this I'm using pandas like this:

pandas.read_fwf('source.txt', colspecs=column_position_length).\
         to_csv('output.csv', header=column_name, index=False, encoding='utf-8')

Where column_position_length and column_name are lists containing the information needed to read and write the data.

Within these files I have long strings of numbers representing test answers. For instance: 333133322122222223133313222222221222111133313333 represents the correct answers on a multiple choice test. So this is more of a code than a numeric value. The problem that I am having is pandas interpreting these values as floats and then writing these values in scientific notation into the csv (3.331333221222221e+47).

I found a lot of questions regarding this issue, but they didn't quite resolve my issue.

  1. Solution 1 - I believe at this point the values have already been converted to floats so this wouldn't help.
  2. Solution 2 - according to the pandas documentation, dtype is not supported as an argument for read_fwf in Python.
  3. Solution 3 Use converters - the issue with using converters is that you need to specify the column name or index to convert to a data type, but I would like to read all of the columns as strings.

The second option seemes to be the go to answer for reading every column in as a string, but unfortunately it just isn't supported for read_fwf. Any suggestions?


Solution

  • So I think I figured out a solution, but I don't know why it works. Pandas was interpreting these values as floats because there were NaN values (blank lines) in the columns. By adding keep_default_na=False to the read_fwf() parameters, it resolved this issue. According to the documentation:

    keep_default_na : bool, default True If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

    I guess I'm not quite understanding how this is fixing my issue. Could anyone add any clarity on this?