Given a dataframe containing strings, how can I use the second dataframe to grab slices from the first as new columns?
df_strs = pl.DataFrame({
"Col_A": ["AAABBCCCC", "DDDEEFFFF"],
"Col_B": ["AAB", "DDE"]
})
print(df_strs)
df_offsets = pl.DataFrame({
"Reference_Col": ["Col_A", "Col_A", "Col_A", "Col_B", "Col_B"],
"Offset": [0, 3, 5, 0, 2],
"Length": [3, 2, 4, 2, 1],
"Alias": ["Col_A_1", "Col_A_2", "Col_A_3", "Col_B_1", "Col_B_2"]
})
print(df_offsets)
df_strs
shape: (2, 2)
┌───────────┬───────┐
│ Col_A ┆ Col_B │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═══════╡
│ AAABBCCCC ┆ AAB │
│ DDDEEFFFF ┆ DDE │
└───────────┴───────┘
df_offsets
shape: (5, 4)
┌───────────────┬────────┬────────┬─────────┐
│ Reference_Col ┆ Offset ┆ Length ┆ Alias │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞═══════════════╪════════╪════════╪═════════╡
│ Col_A ┆ 0 ┆ 3 ┆ Col_A_1 │
│ Col_A ┆ 3 ┆ 2 ┆ Col_A_2 │
│ Col_A ┆ 5 ┆ 4 ┆ Col_A_3 │
│ Col_B ┆ 0 ┆ 2 ┆ Col_B_1 │
│ Col_B ┆ 2 ┆ 1 ┆ Col_B_2 │
└───────────────┴────────┴────────┴─────────┘
So that the resulting dataframe has five columns, where each column is reference[offset:offset+length]
with name defined by alias
?
in the above example, the result would be:
shape: (5, 4)
┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Col_A_1 ┆ Col_A_2 ┆ Col_A_3 ┆ Col_B_1 ┆ Col_B_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════════╪═════════╪═════════╪═════════╡
│ AAA ┆ BB ┆ CCCC ┆ AA ┆ B │
│ DDD ┆ EE ┆ FFFF ┆ DD ┆ E │
└─────────┴─────────┴─────────┴─────────┴─────────┘
I can do this using loops, but I feel there must be a more idiomatic way to use polars. My current solution is:
df_offsets = df_offsets.transpose() # this converts our ints to strs
for s in df_offsets:
df_strs = df_strs.with_columns(
pl.col(s[0]).str.slice(int(s[1]), int(s[2])).alias(s[3])
)
>>>print(df_strs)
shape: (2, 7)
┌───────────┬───────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Col_A ┆ Col_B ┆ Col_A_0 ┆ Col_A_1 ┆ Col_A_2 ┆ Col_B_0 ┆ Col_B_1 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ AAABBCCCC ┆ AAB ┆ AAA ┆ BB ┆ CCCC ┆ AA ┆ B │
│ DDDEEFFFF ┆ DDE ┆ DDD ┆ EE ┆ FFFF ┆ DD ┆ E │
└───────────┴───────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Is there a more optimized way of doing this?
I can think of two potential optimizations.
Your code is running your expressions, one at a time. One easy optimization is to first collect your expressions into a list, and then run all expressions in parallel using with_columns
.
In addition, let's run this in Lazy mode, which applies further optimizations.
Note that we don't want the list comprehension to be done in Lazy mode -- only the df_strs
. (There is no transpose
method for a LazyFrame.)
(
df_strs
.lazy()
.with_columns(
pl.col(s[0]).str.slice(int(s[1]), int(s[2])).alias(s[3])
for s in df_offsets.transpose()
)
.collect()
)
shape: (2, 7)
┌───────────┬───────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Col_A ┆ Col_B ┆ Col_A_1 ┆ Col_A_2 ┆ Col_A_3 ┆ Col_B_1 ┆ Col_B_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ AAABBCCCC ┆ AAB ┆ AAA ┆ BB ┆ CCCC ┆ AA ┆ B │
│ DDDEEFFFF ┆ DDE ┆ DDD ┆ EE ┆ FFFF ┆ DD ┆ E │
└───────────┴───────┴─────────┴─────────┴─────────┴─────────┴─────────┘
This alone should provide an enormous speed-up.
You can experiment with ways to generate the list of expressions more rapidly. The following may or may not speed up the creation of your expressions (rather than transposing df_offsets
). You can certainly try it.
(
df_strs
.lazy()
.with_columns(
eval(expr) for expr in
df_offsets.select(
pl.format(
r"pl.col('{}').str.slice({}, {}).alias('{}')",
*df_offsets.columns
)
)
.to_series()
)
.collect()
)
The trade-off is that we're generating the expressions inside Polars (as strings) without transposing a DataFrame ... but at the expense of using Python's eval
expression to convert the strings to Polars Expression.