Search code examples
pythonpython-polars

Join differently nested lists in polars columns


As you might have recognized from my other questions I am transitioning from pandas to polars right now. I have a polars df with differently nested lists like this:

┌────────────────────────────────────┬────────────────────────────────────┬─────────────────┬──────┐
│ col1                               ┆ col2                               ┆ col3            ┆ col4 │
│ ---                                ┆ ---                                ┆ ---             ┆ ---  │
│ list[list[str]]                    ┆ list[list[str]]                    ┆ list[str]       ┆ str  │
╞════════════════════════════════════╪════════════════════════════════════╪═════════════════╪══════╡
│ [["a", "a"], ["b", "b"], ["c", "c"]┆ [["a", "a"], ["b", "b"], ["c", "c"]┆ ["A", "B", "C"] ┆ 1    │
│ [["a", "a"]]                       ┆ [["a", "a"]]                       ┆ ["A"]           ┆ 2    │
│ [["b", "b"], ["c", "c"]]           ┆ [["b", "b"], ["c", "c"]]           ┆ ["B", "C"]      ┆ 3    │
└────────────────────────────────────┴────────────────────────────────────┴─────────────────┴──────┘

Now I want to join the lists inside out using different separators to reach this:

┌─────────────┬─────────────┬───────┬──────┐
│ col1        ┆ col2        ┆ col3  ┆ col4 │
│ ---         ┆ ---         ┆ ---   ┆ ---  │
│ str         ┆ str         ┆ str   ┆ str  │
╞═════════════╪═════════════╪═══════╪══════╡
│ a+a-b+b-c+c ┆ a+a-b+b-c+c ┆ A-B-C ┆ 1    │
│ a+a         ┆ a+a         ┆ A     ┆ 2    │
│ b+b-c+c     ┆ b+b-c+c     ┆ B-C   ┆ 3    │
└─────────────┴─────────────┴───────┴──────┘

I do this by using map_elements and a for loop, but I guess that is highly inefficient. Is there a polars native way to manage this?

Here is my code:

import polars as pl

df = pl.DataFrame({"col1": [[["a", "a"], ["b", "b"], ["c", "c"]], [["a", "a"]], [["b", "b"], ["c", "c"]]],
                   "col2": [[["a", "a"], ["b", "b"], ["c", "c"]], [["a", "a"]], [["b", "b"], ["c", "c"]]],
                   "col3": [["A", "B", "C"], ["A"], ["B", "C"]],
                   "col4": ["1", "2", "3"]})

nested_list_cols = ["col1", "col2"]
list_cols = ["col3"]

for col in nested_list_cols:
    df = df.with_columns(pl.lit(df[col].map_elements(lambda listed: ['+'.join(element) for element in listed], return_dtype=pl.List(pl.String))).alias(col)) # is the return_dtype always pl.List(pl.String)?
for col in list_cols + nested_list_cols:
    df = df.with_columns(pl.lit(df[col].list.join(separator='-')).alias(col))

Solution

  • You could use list.eval() and .list.join()

    df.with_columns(
        pl.col(nested_list_cols).list.eval(pl.element().list.join("+")).list.join("-"),
        pl.col(list_cols).list.join("-")
    )
    
    shape: (3, 4)
    ┌─────────────┬─────────────┬───────┬──────┐
    │ col1        ┆ col2        ┆ col3  ┆ col4 │
    │ ---         ┆ ---         ┆ ---   ┆ ---  │
    │ str         ┆ str         ┆ str   ┆ str  │
    ╞═════════════╪═════════════╪═══════╪══════╡
    │ a+a-b+b-c+c ┆ a+a-b+b-c+c ┆ A-B-C ┆ 1    │
    │ a+a         ┆ a+a         ┆ A     ┆ 2    │
    │ b+b-c+c     ┆ b+b-c+c     ┆ B-C   ┆ 3    │
    └─────────────┴─────────────┴───────┴──────┘