Search code examples
pythonpython-polarsdata-wrangling

Polars Python select based on dtype pl.list


Hi I want to select those cols of a polars df that are of the dtype list.
Selecting by dtypes works ususally fine with df.select(pl.col(pl.Utf8)).

However for the type list this does not seem to work...

MRE

import polars as pl

df = pl.DataFrame({"foo": [[c] for c in 
    ["100CT pen", "pencils 250CT", "what 125CT soever", "this is a thing"]]}
)

df

Output:

foo
list[str]
["100CT pen"]
["pencils 250CT"]
["what 125CT soever"]
["this is a thing"]

df.select(pl.col(pl.List))

Output:

shape: (0, 0)

Solution

  • You need to provide the type of the items in the List unlike primitive types (where print(df.select(pl.col(pl.Int64))) would work in the below example).

    import polars as pl
    
    df = pl.DataFrame({
        "foo": [[c] for c in 
            ["100CT pen", "pencils 250CT", "what 125CT soever", "this is a thing"]],
        "bar": [1, 2, 3, 4]
        }
    )
    print(df.select(pl.col(pl.List(str))))
    

    I can't seem to find anything that's generic across types that the List contains. There is a NESTED_DTYPES here and this answer suggests that you might be able to use it in a more "catch-all" manner, but it doesn't seem to work if you want to grab columns that contain a nested type regardless of the type of data it contains.


    Thanks to @jqurious for pointing out that this seems to be a requested feature in an open ticket. This has an interesting use-case for me in that, the only reason I've switched dfs back to pandas recently is that polars refuses to write List to CSV so I either filter out all such columns by name or, if this is implemented, I could drop them in one go. I didn't create those columns and I don't want them in the output.