Search code examples
pythonpandascsvvarcharfixed-width

Read fixed-width text file with varchar in pandas


I want to read in a text file that has a fixed width format. Unfortunately, it also contains a varchar field that tells me the length in the beginning (so not so fixed width after all). The file looks something like this

Boris     1520190730     0014likes icecreamblue
Lena      1320190815     0009is blondered

with a schema that looks something like this:

{
'name':10,
'age':2,
'last_visit':8,
'other_field':5,
'comment':???,
'fav_color':8
}

Before I came across the varchar field, my approach was to read it in either with pandas' read_fwf or (with a slightly modified schema) via df[col].str[schema[col][0]:schema[col][1]. This fails of course for the variable length field. At least the field tells me its length at the beginning(0014 and 0009).

Is there an elegant pandas way to read in such a file? Or do I have to loop over it line by line and handle the field dynamically?


Solution

  • You could use read_table with a regex delimiter and a converter to read the data, followed by a little postprocessing (splitting a column), for example:

    import pandas
    
    schema = {
        'name': 10,
        'age': 2,
        'last_visit': 8,
        'other_field': 5,
        'comment': None,
        'fav_color': 8
    }
    
    
    # A converter for the variable length and following columns
    def converter(x):
        """Return the comment and the fav_color values separated by ','."""
        length_len = 4
        comment_len = int(x[:length_len])
        return x[length_len:comment_len + length_len:] + ',' + x[comment_len + length_len:]
    
    
    # A regex as delimiter for the fixed length columns
    delimiter = f"(.{{{schema['name']}}})(.{{{schema['age']}}})(.{{{schema['last_visit']}}}).{{{schema['other_field']}}}(.*)"
    # Use the delimiter and converter (column 4 holds comment and fav_color) for reading the table
    data = pandas.read_table('input.txt', header=None, sep=delimiter, converters={4: converter})
    # Clean the table
    data.dropna(inplace=True, axis=1)
    # Split the comment and the fav_color columns
    data[5], data[6] = data[4].str.split(',', 1).str