Search code examples
python-polars

truncating a String column of a polars LazyFrame to a certain number of characters


I'm preparing some files for upload to a database using polars's scan_csv.

Some of the columns contain very long text which needs to be truncated to a given number of characters, otherwise the upload will fail.

What is the best way to modify a LazyFrame object produced by scan_csv such that all pl.String columns have at most some specified number of characters?

I know the columns to be modified can be identified by inspecting the schema attribute.

Example Data:

Given some mock data:

# pretend this comes from scan_csv()
example = pl.LazyFrame(
    {
        "big_column": ["a" * 1000],
        "another_big_column": ["b" * 1000],
        "small_column": ["c"],
        "integer_column": [1]
    },
    schema = {
        "big_column": pl.String,
        "another_big_column": pl.String,
        "small_column": pl.String,
        "integer_column": pl.Int8
    }
)

I'm imagining we can do something like:

for col, dtype in example.schema.items():
    if dtype is pl.String:
        # do something with col

Questions:

What should the # do something with col part of the above be to achieve what I want without loading the entire LazyFrame into memory?

Is it even possible to perform such a modification in-place?


Solution

  • pl.col can also select by dtype.

    df.with_columns(pl.col(pl.String).str.slice(0, 10)).collect()
    
    shape: (1, 4)
    ┌────────────┬────────────────────┬──────────────┬────────────────┐
    │ big_column ┆ another_big_column ┆ small_column ┆ integer_column │
    │ ---        ┆ ---                ┆ ---          ┆ ---            │
    │ str        ┆ str                ┆ str          ┆ i8             │
    ╞════════════╪════════════════════╪══════════════╪════════════════╡
    │ aaaaaaaaaa ┆ bbbbbbbbbb         ┆ c            ┆ 1              │
    └────────────┴────────────────────┴──────────────┴────────────────┘