python-polars split string column into many columns by delimiter

In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?

data = {"col1": ["a/b/c/d", "a/b/c/d"]}
df = pl.DataFrame(data)

df_pd = df.to_pandas()
df_pd[["a", "b", "c", "d"]] = df_pd["col1"].str.split("/", expand=True)

pl.from_pandas(df_pd)

shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1    ┆ a   ┆ b   ┆ c   ┆ d   │
│ ---     ┆ --- ┆ --- ┆ --- ┆ --- │
│ str     ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a   ┆ b   ┆ c   ┆ d   │
│ a/b/c/d ┆ a   ┆ b   ┆ c   ┆ d   │
└─────────┴─────┴─────┴─────┴─────┘

Solution

Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.

Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.

import polars as pl
df = pl.DataFrame(
    {
        "my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
    }
)
df

shape: (5, 1)
┌─────────────────────────────┐
│ my_str                      │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ cat                         │
│ cat/dog                     │
│ null                        │
│                             │
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘

The Algorithm

The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.

(
    df
    .with_row_index('id')
    .with_columns(pl.col('my_str').str.split('/').alias('split_str'))
    .explode('split_str')
    .with_columns(
        ('string_' + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
        .over('id')
        .alias('col_nm')
    )
    .pivot(
        on='col_nm',
        index=['id', 'my_str']
    )
    .with_columns(
        pl.col('^string_.*$').fill_null('')
    )
)

shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆           ┆           ┆           ┆           │
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆           ┆           ┆           │
│ 2   ┆ null                        ┆           ┆           ┆           ┆           ┆           │
│ 3   ┆                             ┆           ┆           ┆           ┆           ┆           │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

How it works

We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.

(
    df
    .with_row_index("id")
    .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
)

shape: (5, 3)
┌─────┬─────────────────────────────┬─────────────────────────────────┐
│ id  ┆ my_str                      ┆ split_str                       │
│ --- ┆ ---                         ┆ ---                             │
│ u32 ┆ str                         ┆ list[str]                       │
╞═════╪═════════════════════════════╪═════════════════════════════════╡
│ 0   ┆ cat                         ┆ ["cat"]                         │
│ 1   ┆ cat/dog                     ┆ ["cat", "dog"]                  │
│ 2   ┆ null                        ┆ null                            │
│ 3   ┆                             ┆ [""]                            │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", "aardvark", "mo… │
└─────┴─────────────────────────────┴─────────────────────────────────┘

Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)

(
    df
    .with_row_index("id")
    .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
)

shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str │
│ --- ┆ ---                         ┆ ---       │
│ u32 ┆ str                         ┆ str       │
╞═════╪═════════════════════════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       │
│ 1   ┆ cat/dog                     ┆ cat       │
│ 1   ┆ cat/dog                     ┆ dog       │
│ 2   ┆ null                        ┆ null      │
│ 3   ┆                             ┆           │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      │
└─────┴─────────────────────────────┴───────────┘

In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.

I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)

You can substitute your own naming in this step as you need.

(
    df
    .with_row_index("id")
    .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
    .with_columns(
        ("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
        .over("id")
        .alias("col_nm")
    )
)

shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ split_str ┆ col_nm    │
│ --- ┆ ---                         ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ string_00 │
│ 1   ┆ cat/dog                     ┆ cat       ┆ string_00 │
│ 1   ┆ cat/dog                     ┆ dog       ┆ string_01 │
│ 2   ┆ null                        ┆ null      ┆ string_00 │
│ 3   ┆                             ┆           ┆ string_00 │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ string_00 │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       ┆ string_01 │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  ┆ string_02 │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     ┆ string_03 │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘

In the next step, we'll use the pivot function to place each string in its own column.

(
    df
    .with_row_index("id")
    .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
    .explode("split_str")
    .with_columns(
        ("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
        .over("id")
        .alias("col_nm")
    )
    .pivot(
        on="col_nm",
        index=["id", "my_str"]
    )
)

shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0   ┆ cat                         ┆ cat       ┆ null      ┆ null      ┆ null      ┆ null      │
│ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆ null      ┆ null      ┆ null      │
│ 2   ┆ null                        ┆ null      ┆ null      ┆ null      ┆ null      ┆ null      │
│ 3   ┆                             ┆           ┆ null      ┆ null      ┆ null      ┆ null      │
│ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)