Does Python Polars have a function similar to Pandas read_fwf ( for reading fixed-width formatted files)? Need to read txt files with fixed width data of the type below, and then apply a schema with columns based on positions in the file.
Sample Data: A1G01409xxx xxxx T 30499759910.001099101000199105100105430021 47.570952 -53.32332502xxxxxx WW19921220190000010AAA11 A31124xxxxxx No. 1, Subd. U SNO21499649910.001099101000199105100107410182 47.166651 -52.88563505MOBILE WW19921220190000011NNN51 A3B01339Division No. 1, Subd. G SNO35799759910.001003701000299105100105000122 47.945009 -53.06603613 WW19921220190000011NNN51 A3P01379 T 35799759910.001099101000101364100105250252 47.678897 -53.25694413RIVERHEAD WW19921220190000011NNN52
Illustrative schema: schema_dict = { "Postal_codeOM": (1, 6), "FSA": (7, 3), # ... and so on for other columns ... }
@jqurious is right here there are some solutions but not directly a function read_fwf
inside polars.
You can generate a list of polars expressions with a for loop inside with_columns and make your own purpose read_fwf
.
See here for the first time it appears on GitHub issues:
https://github.com/pola-rs/polars/issues/3151#issuecomment-1397354684
quoting @ghuls example (str.strip()
becomes str.strip_chars()
):
a sample fwf file:
$ cat fwf_test.txt
NAME STATE TELEPHONE
John Smith WA 418-Y11-4111
Mary Hartford CA 319-Z19-4341
Evan Nolan IL 219-532-c301
you have column names, column widths and calculate "slice tuples" for substring full string column.
df = pl.read_csv(
"fwf_test.txt",
has_header=False,
skip_rows=1,
new_columns=["full_str"]
)
column_names = [ "NAME", "STATE", "TELEPHONE" ]
widths = [20, 10, 12]
# Calculate slice values from widths.
slice_tuples = []
offset = 0
for i in widths:
slice_tuples.append((offset, i))
offset += i
df.with_columns(
[
pl.col("full_str").str.slice(slice_tuple[0], slice_tuple[1]).str.strip_chars().alias(col)
for slice_tuple, col in zip(slice_tuples, column_names)
]
).drop("full_str")
shape: (3, 3)
┌───────────────┬───────┬──────────────┐
│ NAME ┆ STATE ┆ TELEPHONE │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════════╪═══════╪══════════════╡
│ John Smith ┆ WA ┆ 418-Y11-4111 │
│ Mary Hartford ┆ CA ┆ 319-Z19-4341 │
│ Evan Nolan ┆ IL ┆ 219-532-c301 │
└───────────────┴───────┴──────────────┘
Then you can read see @jqurious link which is much complete with types casting, but iteration over zip(slice tuples, names, ...) remains the same logic.
It's faster than many read_fwf
implementations without polars.