I'm preparing some files for upload to a database using polars
's scan_csv
.
Some of the columns contain very long text which needs to be truncated to a given number of characters, otherwise the upload will fail.
What is the best way to modify a LazyFrame
object produced by scan_csv
such that all pl.String
columns have at most some specified number of characters?
I know the columns to be modified can be identified by inspecting the schema
attribute.
Given some mock data:
# pretend this comes from scan_csv()
example = pl.LazyFrame(
{
"big_column": ["a" * 1000],
"another_big_column": ["b" * 1000],
"small_column": ["c"],
"integer_column": [1]
},
schema = {
"big_column": pl.String,
"another_big_column": pl.String,
"small_column": pl.String,
"integer_column": pl.Int8
}
)
I'm imagining we can do something like:
for col, dtype in example.schema.items():
if dtype is pl.String:
# do something with col
What should the # do something with col
part of the above be to achieve what I want without loading the entire LazyFrame
into memory?
Is it even possible to perform such a modification in-place?
pl.col
can also select by dtype.
df.with_columns(pl.col(pl.String).str.slice(0, 10)).collect()
shape: (1, 4)
┌────────────┬────────────────────┬──────────────┬────────────────┐
│ big_column ┆ another_big_column ┆ small_column ┆ integer_column │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i8 │
╞════════════╪════════════════════╪══════════════╪════════════════╡
│ aaaaaaaaaa ┆ bbbbbbbbbb ┆ c ┆ 1 │
└────────────┴────────────────────┴──────────────┴────────────────┘