Search code examples
pythonpython-polars

Python Polars - how to replace strings in a df column with lists with values from dictionary?


This is a follow up to a question that previously answered.

Have a large dataframe df that looks like this (list in column 'SKU')

| SKU                                                                  | Count | Percent     
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)"                                            | 110   | 0.029633621 |
| "('000000009100000749', '000000009100000776')"                       | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1     | 0.000269397 |

Need to replace the values in the 'SKU' column with corresponding values from a dictionary df_unique that looks like this (please ignore format below, it is a dict):

skus str code i64
000000009100000749 1
000000009100000785 2
000000009100002088 3

I have tried this code:

replacements = pl.col("SKU")

for old, new in df_unique.items():
    replacements = replacements.str.replace_all(old, new)
df = df.select(replacements)

Get this error: SchemaError: Series of dtype: List(Utf8) != Utf8

I have tried to change the column values to string, alhtough I think it is redundant, but same error

df= df.with_column(
    pl.col('SKU').apply(lambda row: [str(x) for x in row])
    )

Any guidance on what I am doing wrong?


Solution

  • It would help if you showed the actual list type of the column:

    It looks like you have "stringified" tuples but it's not entirely clear.

    df = pl.DataFrame({
       "SKU": [["000000009100000749"], ["000000009100000749", "000000009100000776"]]
    })
    
    sku_to_code = {
        "000000009100000749": 1,
        "000000009100000785": 2,
        "000000009100002088": 3
    }
    
    >>> df
    shape: (2, 1)
    ┌─────────────────────────────────────┐
    │ SKU                                 │
    │ ---                                 │
    │ list[str]                           │
    ╞═════════════════════════════════════╡
    │ ["000000009100000749"]              │
    │ ["000000009100000749", "00000000... │
    └─────────────────────────────────────┘
    

    .list.eval() allows you to run expressions on lists.

    pl.element() can be used to refer to the list inside list.eval

    replace_sku = pl.element()
    for old, new in df_unique.items():
        replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
    
    df.select(pl.col("SKU").list.eval(replace_sku))
    
    shape: (2, 1)
    ┌─────────────────────────────┐
    │ SKU                         │
    │ ---                         │
    │ list[str]                   │
    ╞═════════════════════════════╡
    │ ["1"]                       │
    │ ["1", "000000009100000776"] │
    └─────────────────────────────┘