Search code examples
python-polars

TypeError when using regex to change column names in polars


I have a df:

import polars as pl

df = pl.DataFrame({
    "A": [0],
    "B": [1],
    '{"C_C", "1"}': [2],
    '{"D_D", "6"}': [3],
})

I want to change the column names so that if they have quotation marks they are joined with an underscore and _count is added at end, so {"C_C", "1"} becomes C_C_1_count. I have tried:

def flatten_pivot_polars(d:pl.DataFrame, col_str: str)->pl.DataFrame:

  import re

  d=d.select(
        pl.exclude(["Step", "RunId"]).name.map(lambda col_name: 
           '_'.join([re.findall('"([^"]*)"',col_name), col_str]))
        )
  return d

flatten_pivot_polars(df, 'count')

but this gives:

ComputeError: Python function in 'name.map' produced an error: 
TypeError: sequence item 0: expected str instance, list found.

I am guessing it is because I am not excluding the non quoted columns properly but don't know what else to do.


Solution

  • re.findall returns all non-overlapping matches, as a list of strings. You want to append col_str to this list. For this, you can use the following.

    re.findall('"([^"]*)"',col_name) + [col_str]
    

    Instead of

    [re.findall('"([^"]*)"',col_name), col_str]
    

    which would end up with a nested list [[matches], col_str].