Search code examples
python-polars

Idiomatic replacement of empty string '' with pl.Null (null) in polars


I have a polars DataFrame with a number of Series that look like:

pl.Series(['cow', 'cat', '', 'lobster', ''])

and I'd like them to be

pl.Series(['cow', 'cat', pl.Null, 'lobster', pl.Null])

A simple string replacement won't work since pl.Null is not of type PyString:

pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', pl.Null)

What's the idiomatic way of doing this for a Series/DataFrame in polars?


Solution

  • Series

    For a single Series, you can use the set method.

    s = pl.Series(["cow", "cat", "", "lobster", ""])
    s.set(s.str.len_chars() == 0, None)
    
    shape: (5,)
    Series: '' [str]
    [
            "cow"
            "cat"
            null
            "lobster"
            null
    ]
    

    DataFrame

    For DataFrames, I would suggest using when/then/otherwise. For example, with this data:

    df = pl.DataFrame({
        "str1": ["cow", "dog", "", "lobster", ""],
        "str2": ["", "apple", "orange", "", "kiwi"],
        "str3": ["house", "", "apartment", "condo", ""],
    })
    
    shape: (5, 3)
    ┌─────────┬────────┬───────────┐
    │ str1    ┆ str2   ┆ str3      │
    │ ---     ┆ ---    ┆ ---       │
    │ str     ┆ str    ┆ str       │
    ╞═════════╪════════╪═══════════╡
    │ cow     ┆        ┆ house     │
    │ dog     ┆ apple  ┆           │
    │         ┆ orange ┆ apartment │
    │ lobster ┆        ┆ condo     │
    │         ┆ kiwi   ┆           │
    └─────────┴────────┴───────────┘
    

    We can run a replacement on all string columns as follows:

    df.with_columns(
        pl.when(pl.col(pl.String).str.len_chars() == 0)
        .then(None)
        .otherwise(pl.col(pl.String))
        .name.keep()
    )
    
    shape: (5, 3)
    ┌─────────┬────────┬───────────┐
    │ str1    ┆ str2   ┆ str3      │
    │ ---     ┆ ---    ┆ ---       │
    │ str     ┆ str    ┆ str       │
    ╞═════════╪════════╪═══════════╡
    │ cow     ┆ null   ┆ house     │
    │ dog     ┆ apple  ┆ null      │
    │ null    ┆ orange ┆ apartment │
    │ lobster ┆ null   ┆ condo     │
    │ null    ┆ kiwi   ┆ null      │
    └─────────┴────────┴───────────┘
    

    The above should be fairly performant.

    If you only want to replace empty strings with null on certain columns, you can provide a list:

    only_these = ["str1", "str2"]
    df.with_columns(
        pl.when(pl.col(only_these).str.len_chars() == 0)
        .then(None)
        .otherwise(pl.col(only_these))
        .name.keep()
    )
    
    shape: (5, 3)
    ┌─────────┬────────┬───────────┐
    │ str1    ┆ str2   ┆ str3      │
    │ ---     ┆ ---    ┆ ---       │
    │ str     ┆ str    ┆ str       │
    ╞═════════╪════════╪═══════════╡
    │ cow     ┆ null   ┆ house     │
    │ dog     ┆ apple  ┆           │
    │ null    ┆ orange ┆ apartment │
    │ lobster ┆ null   ┆ condo     │
    │ null    ┆ kiwi   ┆           │
    └─────────┴────────┴───────────┘