Search code examples
pythonpython-polars

python-polars split string column into many columns by delimiter


In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?

d = {"col1": ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split("/",expand=True)
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1    ┆ a   ┆ b   ┆ c   ┆ d   │
│ ---     ┆ --- ┆ --- ┆ --- ┆ --- │
│ str     ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a   ┆ b   ┆ c   ┆ d   │
│ a/b/c/d ┆ a   ┆ b   ┆ c   ┆ d   │
└─────────┴─────┴─────┴─────┴─────┘

Solution

  • Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.

    Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.

    import polars as pl
    df = pl.DataFrame(
        {
            "my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
        }
    )
    df
    
    shape: (5, 1)
    ┌─────────────────────────────┐
    │ my_str                      │
    │ ---                         │
    │ str                         │
    ╞═════════════════════════════╡
    │ cat                         │
    │ cat/dog                     │
    │ null                        │
    │                             │
    │ cat/dog/aardvark/mouse/frog │
    └─────────────────────────────┘
    

    The Algorithm

    The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.

    (
        df
        .with_row_index('id')
        .with_columns(pl.col('my_str').str.split('/').alias('split_str'))
        .explode('split_str')
        .with_columns(
            ('string_' + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
            .over('id')
            .alias('col_nm')
        )
        .pivot(
            on='col_nm',
            index=['id', 'my_str']
        )
        .with_columns(
            pl.col('^string_.*$').fill_null('')
        )
    )
    
    shape: (5, 7)
    ┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
    │ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
    │ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
    │ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
    ╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
    │ 0   ┆ cat                         ┆ cat       ┆           ┆           ┆           ┆           │
    │ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆           ┆           ┆           │
    │ 2   ┆ null                        ┆           ┆           ┆           ┆           ┆           │
    │ 3   ┆                             ┆           ┆           ┆           ┆           ┆           │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
    └─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
    

    How it works

    We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.

    (
        df
        .with_row_index("id")
        .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
    )
    
    shape: (5, 3)
    ┌─────┬─────────────────────────────┬─────────────────────────────────┐
    │ id  ┆ my_str                      ┆ split_str                       │
    │ --- ┆ ---                         ┆ ---                             │
    │ u32 ┆ str                         ┆ list[str]                       │
    ╞═════╪═════════════════════════════╪═════════════════════════════════╡
    │ 0   ┆ cat                         ┆ ["cat"]                         │
    │ 1   ┆ cat/dog                     ┆ ["cat", "dog"]                  │
    │ 2   ┆ null                        ┆ null                            │
    │ 3   ┆                             ┆ [""]                            │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", "aardvark", "mo… │
    └─────┴─────────────────────────────┴─────────────────────────────────┘
    

    Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)

    (
        df
        .with_row_index("id")
        .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
        .explode("split_str")
    )
    
    shape: (10, 3)
    ┌─────┬─────────────────────────────┬───────────┐
    │ id  ┆ my_str                      ┆ split_str │
    │ --- ┆ ---                         ┆ ---       │
    │ u32 ┆ str                         ┆ str       │
    ╞═════╪═════════════════════════════╪═══════════╡
    │ 0   ┆ cat                         ┆ cat       │
    │ 1   ┆ cat/dog                     ┆ cat       │
    │ 1   ┆ cat/dog                     ┆ dog       │
    │ 2   ┆ null                        ┆ null      │
    │ 3   ┆                             ┆           │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      │
    └─────┴─────────────────────────────┴───────────┘
    

    In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.

    I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)

    You can substitute your own naming in this step as you need.

    (
        df
        .with_row_index("id")
        .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
        .explode("split_str")
        .with_columns(
            ("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
            .over("id")
            .alias("col_nm")
        )
    )
    
    shape: (10, 4)
    ┌─────┬─────────────────────────────┬───────────┬───────────┐
    │ id  ┆ my_str                      ┆ split_str ┆ col_nm    │
    │ --- ┆ ---                         ┆ ---       ┆ ---       │
    │ u32 ┆ str                         ┆ str       ┆ str       │
    ╞═════╪═════════════════════════════╪═══════════╪═══════════╡
    │ 0   ┆ cat                         ┆ cat       ┆ string_00 │
    │ 1   ┆ cat/dog                     ┆ cat       ┆ string_00 │
    │ 1   ┆ cat/dog                     ┆ dog       ┆ string_01 │
    │ 2   ┆ null                        ┆ null      ┆ string_00 │
    │ 3   ┆                             ┆           ┆ string_00 │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ string_00 │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ dog       ┆ string_01 │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ aardvark  ┆ string_02 │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ mouse     ┆ string_03 │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ frog      ┆ string_04 │
    └─────┴─────────────────────────────┴───────────┴───────────┘
    

    In the next step, we'll use the pivot function to place each string in its own column.

    (
        df
        .with_row_index("id")
        .with_columns(pl.col("my_str").str.split("/").alias("split_str"))
        .explode("split_str")
        .with_columns(
            ("string_" + pl.int_range(pl.len()).cast(pl.String).str.zfill(2))
            .over("id")
            .alias("col_nm")
        )
        .pivot(
            on="col_nm",
            index=["id", "my_str"]
        )
    )
    
    shape: (5, 7)
    ┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
    │ id  ┆ my_str                      ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
    │ --- ┆ ---                         ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
    │ u32 ┆ str                         ┆ str       ┆ str       ┆ str       ┆ str       ┆ str       │
    ╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
    │ 0   ┆ cat                         ┆ cat       ┆ null      ┆ null      ┆ null      ┆ null      │
    │ 1   ┆ cat/dog                     ┆ cat       ┆ dog       ┆ null      ┆ null      ┆ null      │
    │ 2   ┆ null                        ┆ null      ┆ null      ┆ null      ┆ null      ┆ null      │
    │ 3   ┆                             ┆           ┆ null      ┆ null      ┆ null      ┆ null      │
    │ 4   ┆ cat/dog/aardvark/mouse/frog ┆ cat       ┆ dog       ┆ aardvark  ┆ mouse     ┆ frog      │
    └─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
    

    All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)