Search code examples
python-polars

Use values from one dataframe to select substring from another


Given a dataframe containing strings, how can I use the second dataframe to grab slices from the first as new columns?

df_strs = pl.DataFrame({
    "Col_A": ["AAABBCCCC", "DDDEEFFFF"],
    "Col_B": ["AAB", "DDE"]
})
print(df_strs)

df_offsets = pl.DataFrame({
    "Reference_Col": ["Col_A", "Col_A", "Col_A", "Col_B", "Col_B"],
    "Offset": [0, 3, 5, 0, 2],
    "Length": [3, 2, 4, 2, 1],
    "Alias": ["Col_A_1", "Col_A_2", "Col_A_3", "Col_B_1", "Col_B_2"]
})
print(df_offsets)

Output:

df_strs
shape: (2, 2)
┌───────────┬───────┐
│ Col_A     ┆ Col_B │
│ ---       ┆ ---   │
│ str       ┆ str   │
╞═══════════╪═══════╡
│ AAABBCCCC ┆ AAB   │
│ DDDEEFFFF ┆ DDE   │
└───────────┴───────┘

df_offsets
shape: (5, 4)
┌───────────────┬────────┬────────┬─────────┐
│ Reference_Col ┆ Offset ┆ Length ┆ Alias   │
│ ---           ┆ ---    ┆ ---    ┆ ---     │
│ str           ┆ i64    ┆ i64    ┆ str     │
╞═══════════════╪════════╪════════╪═════════╡
│ Col_A         ┆ 0      ┆ 3      ┆ Col_A_1 │
│ Col_A         ┆ 3      ┆ 2      ┆ Col_A_2 │
│ Col_A         ┆ 5      ┆ 4      ┆ Col_A_3 │
│ Col_B         ┆ 0      ┆ 2      ┆ Col_B_1 │
│ Col_B         ┆ 2      ┆ 1      ┆ Col_B_2 │
└───────────────┴────────┴────────┴─────────┘

So that the resulting dataframe has five columns, where each column is reference[offset:offset+length] with name defined by alias?

in the above example, the result would be:

shape: (5, 4)
┌─────────┬─────────┬─────────┬─────────┬─────────┐
│ Col_A_1 ┆ Col_A_2 ┆ Col_A_3 ┆ Col_B_1 ┆ Col_B_2 │
│ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╪═════════╡
│ AAA     ┆ BB      ┆ CCCC    ┆ AA      ┆ B       │
│ DDD     ┆ EE      ┆ FFFF    ┆ DD      ┆ E       │
└─────────┴─────────┴─────────┴─────────┴─────────┘

I can do this using loops, but I feel there must be a more idiomatic way to use polars. My current solution is:

df_offsets = df_offsets.transpose() # this converts our ints to strs
for s in df_offsets:
    df_strs = df_strs.with_columns(
        pl.col(s[0]).str.slice(int(s[1]), int(s[2])).alias(s[3])
    )

Output:

>>>print(df_strs)
shape: (2, 7)
┌───────────┬───────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Col_A     ┆ Col_B ┆ Col_A_0 ┆ Col_A_1 ┆ Col_A_2 ┆ Col_B_0 ┆ Col_B_1 │
│ ---       ┆ ---   ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str       ┆ str   ┆ str     ┆ str     ┆ str     ┆ str     ┆ str     │
╞═══════════╪═══════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ AAABBCCCC ┆ AAB   ┆ AAA     ┆ BB      ┆ CCCC    ┆ AA      ┆ B       │
│ DDDEEFFFF ┆ DDE   ┆ DDD     ┆ EE      ┆ FFFF    ┆ DD      ┆ E       │
└───────────┴───────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Is there a more optimized way of doing this?


Solution

  • I can think of two potential optimizations.

    Major: Running your expressions in parallel

    Your code is running your expressions, one at a time. One easy optimization is to first collect your expressions into a list, and then run all expressions in parallel using with_columns.

    In addition, let's run this in Lazy mode, which applies further optimizations.

    Note that we don't want the list comprehension to be done in Lazy mode -- only the df_strs. (There is no transpose method for a LazyFrame.)

    (
        df_strs
        .lazy()
        .with_columns(
            pl.col(s[0]).str.slice(int(s[1]), int(s[2])).alias(s[3])
            for s in df_offsets.transpose()
        )
        .collect()
    )
    
    shape: (2, 7)
    ┌───────────┬───────┬─────────┬─────────┬─────────┬─────────┬─────────┐
    │ Col_A     ┆ Col_B ┆ Col_A_1 ┆ Col_A_2 ┆ Col_A_3 ┆ Col_B_1 ┆ Col_B_2 │
    │ ---       ┆ ---   ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
    │ str       ┆ str   ┆ str     ┆ str     ┆ str     ┆ str     ┆ str     │
    ╞═══════════╪═══════╪═════════╪═════════╪═════════╪═════════╪═════════╡
    │ AAABBCCCC ┆ AAB   ┆ AAA     ┆ BB      ┆ CCCC    ┆ AA      ┆ B       │
    │ DDDEEFFFF ┆ DDE   ┆ DDD     ┆ EE      ┆ FFFF    ┆ DD      ┆ E       │
    └───────────┴───────┴─────────┴─────────┴─────────┴─────────┴─────────┘
    
    

    This alone should provide an enormous speed-up.

    Minor: Generating the list of expression.

    You can experiment with ways to generate the list of expressions more rapidly. The following may or may not speed up the creation of your expressions (rather than transposing df_offsets). You can certainly try it.

    (
        df_strs
        .lazy()
        .with_columns(
            eval(expr) for expr in
            df_offsets.select(
                pl.format(
                    r"pl.col('{}').str.slice({}, {}).alias('{}')",
                    *df_offsets.columns
                )
            )
            .to_series()
        )
        .collect()
    )
    

    The trade-off is that we're generating the expressions inside Polars (as strings) without transposing a DataFrame ... but at the expense of using Python's eval expression to convert the strings to Polars Expression.