In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?
data = {"col1": ["a/b/c/d", "a/b/c/d"]}
df = pl.DataFrame(data)
df_pd = df.to_pandas()
df_pd[["a", "b", "c", "d"]] = df_pd["col1"].str.split("/", expand=True)
pl.from_pandas(df_pd)
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
└─────────┴─────┴─────┴─────┴─────┘
Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.
Let's start with this data. Notice that I've purposely added the empty string ""
and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.
import polars as pl
df = pl.DataFrame(
{
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str │
│ --- │
│ str │
╞═════════════════════════════╡
│ cat │
│ cat/dog │
│ null │
│ │
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘
The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.
(
df
.with_row_index('id')
.with_columns(pl.col('my_str').str.split('/').alias('split_str'))
.explode('split_str')
.with_columns(
('string_' + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
.over('id')
.alias('col_nm')
)
.pivot(
on='col_nm',
index=['id', 'my_str']
)
.with_columns(
pl.col('^string_.*$').fill_null('')
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ ┆ ┆ ┆ │
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ ┆ ┆ │
│ 2 ┆ null ┆ ┆ ┆ ┆ ┆ │
│ 3 ┆ ┆ ┆ ┆ ┆ ┆ │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
We first assign a row number id
(which we'll need later), and use split
to separate the strings. Note that the split strings form a list.
(
df
.with_row_index("id")
.with_columns(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬─────────────────────────────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[str] │
╞═════╪═════════════════════════════╪═════════════════════════════════╡
│ 0 ┆ cat ┆ ["cat"] │
│ 1 ┆ cat/dog ┆ ["cat", "dog"] │
│ 2 ┆ null ┆ null │
│ 3 ┆ ┆ [""] │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", "aardvark", "mo… │
└─────┴─────────────────────────────┴─────────────────────────────────┘
Next, we'll use explode
to put each string on its own row. (Notice how the id
column tracks the original row that each string came from.)
(
df
.with_row_index("id")
.with_columns(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╡
│ 0 ┆ cat ┆ cat │
│ 1 ┆ cat/dog ┆ cat │
│ 1 ┆ cat/dog ┆ dog │
│ 2 ┆ null ┆ null │
│ 3 ┆ ┆ │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog │
└─────┴─────────────────────────────┴───────────┘
In the next step, we're going to generate our column names. I chose to call each column string_XX
where XX
is the offset with regards to the original string.
I've used the handy zfill
expression so that 1
becomes 01
. (This makes sure that string_02
comes before string_10
if you decide to sort your columns later.)
You can substitute your own naming in this step as you need.
(
df
.with_row_index("id")
.with_columns(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_columns(
("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
.over("id")
.alias("col_nm")
)
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id ┆ my_str ┆ split_str ┆ col_nm │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ string_00 │
│ 1 ┆ cat/dog ┆ cat ┆ string_00 │
│ 1 ┆ cat/dog ┆ dog ┆ string_01 │
│ 2 ┆ null ┆ null ┆ string_00 │
│ 3 ┆ ┆ ┆ string_00 │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ string_00 │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog ┆ string_01 │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark ┆ string_02 │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse ┆ string_03 │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘
In the next step, we'll use the pivot
function to place each string in its own column.
(
df
.with_row_index("id")
.with_columns(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_columns(
("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
on="col_nm",
index=["id", "my_str"]
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ null ┆ null ┆ null │
│ 2 ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
│ 3 ┆ ┆ ┆ null ┆ null ┆ null ┆ null │
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
All that remains is to use fill_null
to replace the null
values with an empty string ""
. Notice that I've used a regex expression in the col
expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with ""
everywhere in your data.)