Search code examples
pythonpython-polars

How do I filter rows in which list contains null in polars DataFrame?


I want to read a csv file in which some column are string of this type: "[1.0, 0.0, -2.3]".

So I want to parse this strings as a list of float, to do so I use this code.

import polars as pl

df = pl.scan_csv(
    "foo.csv",
).with_columns(
    pl.col("^list.*$").str.strip("[ ]").str.split(",").cast(pl.List(pl.Float64)),
).collect()

Some of the strings are empty lists: "[]"; and when I parse them I get a list like this one: [null].

I would like to remove all the rows in which there is a list which contains an empty element.

This is the code I tried to use

df.filter(
    pl.all_horizontal(
        ~pl.col("^list.*$").list.contains(None)
    )
)

But it gives me this error

InvalidOperationError: `is_in` operation not supported for dtype `list[f64]`

I provide a little example file to test the code yourself:

list1,list2,list3
"[]","[0.0, 1.2]","[3.4]"
"[102, 506]","[12, 5.2]","[2.3]"
"[15]","[0.5, 8.2]","[]"

In this example I would like only the second row to be kept in the DataFrame.


Solution

  • I think you want to replace (or replace_all) the spaces as your current code loses 506

    df = pl.read_csv(b"""
    list1,list2,list3
    "[]","[0.0, 1.2]","[3.4]"
    "[102, 506]","[12, 5.2]","[2.3]"
    "[15]","[0.5, 8.2]","[]"
    """)
    

    If you filter first you can avoid the dtype issue:

    (df.filter(pl.all_horizontal(pl.all() != "[]"))
       .with_columns(
          pl.all().str.replace(" ", "")
            .str.strip_chars("[]")
            .str.split(",")
            .cast(pl.List(pl.Float64))
       )
    )
    
    shape: (1, 3)
    ┌────────────────┬─────────────┬───────────┐
    │ list1          ┆ list2       ┆ list3     │
    │ ---            ┆ ---         ┆ ---       │
    │ list[f64]      ┆ list[f64]   ┆ list[f64] │
    ╞════════════════╪═════════════╪═══════════╡
    │ [102.0, 506.0] ┆ [12.0, 5.2] ┆ [2.3]     │
    └────────────────┴─────────────┴───────────┘