Search code examples
python-polars

Does Python Polars have a function similar to Pandas with read_fwf ( for reading fixed-width formatted files)


Does Python Polars have a function similar to Pandas read_fwf ( for reading fixed-width formatted files)? Need to read txt files with fixed width data of the type below, and then apply a schema with columns based on positions in the file.

Sample Data: A1G01409xxx xxxx T 30499759910.001099101000199105100105430021 47.570952 -53.32332502xxxxxx WW19921220190000010AAA11 A31124xxxxxx No. 1, Subd. U SNO21499649910.001099101000199105100107410182 47.166651 -52.88563505MOBILE WW19921220190000011NNN51 A3B01339Division No. 1, Subd. G SNO35799759910.001003701000299105100105000122 47.945009 -53.06603613 WW19921220190000011NNN51 A3P01379 T 35799759910.001099101000101364100105250252 47.678897 -53.25694413RIVERHEAD WW19921220190000011NNN52

Illustrative schema: schema_dict = { "Postal_codeOM": (1, 6), "FSA": (7, 3), # ... and so on for other columns ... }


Solution

  • @jqurious is right here there are some solutions but not directly a function read_fwf inside polars.

    You can generate a list of polars expressions with a for loop inside with_columns and make your own purpose read_fwf.

    See here for the first time it appears on GitHub issues:

    https://github.com/pola-rs/polars/issues/3151#issuecomment-1397354684

    quoting @ghuls example (str.strip() becomes str.strip_chars()):

    a sample fwf file:

    $ cat fwf_test.txt
    NAME                STATE     TELEPHONE  
    John Smith          WA        418-Y11-4111
    Mary Hartford       CA        319-Z19-4341
    Evan Nolan          IL        219-532-c301
    

    you have column names, column widths and calculate "slice tuples" for substring full string column.

    df = pl.read_csv(
        "fwf_test.txt",
        has_header=False,
        skip_rows=1,
        new_columns=["full_str"]
    )
    
    column_names = [ "NAME", "STATE", "TELEPHONE" ]
    widths = [20, 10, 12]
    
    
    # Calculate slice values from widths.
    slice_tuples = []
    offset = 0
    
    for i in widths:
        slice_tuples.append((offset, i))
        offset += i
    
    df.with_columns(
        [
           pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip_chars().alias(col)
           for slice_tuple, col in zip(slice_tuples, column_names)
        ]
    ).drop("full_str")
    
    shape: (3, 3)
    ┌───────────────┬───────┬──────────────┐
    │ NAME          ┆ STATE ┆ TELEPHONE    │
    │ ---           ┆ ---   ┆ ---          │
    │ str           ┆ str   ┆ str          │
    ╞═══════════════╪═══════╪══════════════╡
    │ John Smith    ┆ WA    ┆ 418-Y11-4111 │
    │ Mary Hartford ┆ CA    ┆ 319-Z19-4341 │
    │ Evan Nolan    ┆ IL    ┆ 219-532-c301 │
    └───────────────┴───────┴──────────────┘
    

    Then you can read see @jqurious link which is much complete with types casting, but iteration over zip(slice tuples, names, ...) remains the same logic.

    It's faster than many read_fwf implementations without polars.