Search code examples
python-polars

Polars more concise way to replace empty list with null


Question overview

I am extracting data from a newline-delimited JSON file by applying a series of transformations. One of my transformations results in a list of values; for cases where the list is empty, I want the value to be null rather than an empty list. I have code that works, but it seems very convoluted and I'm wondering if there's a simpler way to do this that I am missing.

More detail

In each JSON object of my ndjson file, one of the data elements I'm interested in is an array of telecom nested JSON objects.

{
   ... some data
    "telecom":
        [
            {
                "rank": 1,
                "system": "phone",
                "use": "work",
                "value": "(123) 456-7890"
            },
            {
                "rank": 2,
                "system": "fax",
                "use": "work",
                "value": "(123) 456-7891"
            }
        ]
    ... some other data
}

As part of a larger data extraction operation, I am doing:

df.select(
   expr_first(),
   expr_extract_phone(),
   expr_others()  
)

where expr_first(), expr_extract_phone() and expr_others() return Polars expressions that perform some transformation on various fields of my dataset.

For expr_extract_phone() I want to get a list of phone numbers from telecom as follows:

  • for each nested object in the telecom array, extract value where system=="phone"
  • collect all the individual phone numbers in a list
  • if the list is empty, the value of the column should be null rather than []

I have been able to cobble together something that works:

def expr_extract_phone() -> pl.Expr:
    return pl.col('telecom').list.eval(
            pl.element().filter(pl.element().struct['system'] == 'phone').struct['value']
        ).list.unique().alias('phone_numbers').map_batches(lambda col:
            pl.LazyFrame(col).select(
                pl.when(pl.col('phone_numbers').list.len() > 0)\
                .then(pl.col('phone_numbers'))
            ).collect().get_column('phone_numbers'),
            return_dtype=pl.List(pl.String),
            is_elementwise=True
        )

Getting the list of phone numbers seems straightforward enough, however the entire map_batches portion to replace an empty list [] with a null value seems very convoluted. Is there a simpler way to accomplish what I'm trying to do?

For strings this SO post seems to provide a nice clean way to handle but I can't seem to find an equivalent for a list.


Solution

  • For a fixed column of type pl.List(...), you could simply use an pl.when().then() construct as follows.

    import polars as pl
    
    df = pl.DataFrame({
        "a": [[1, 2], [3], [], [4, 5], [], [6]]
    })
    
    shape: (6, 1)
    ┌───────────┐
    │ a         │
    │ ---       │
    │ list[i64] │
    ╞═══════════╡
    │ [1, 2]    │
    │ [3]       │
    │ []        │
    │ [4, 5]    │
    │ []        │
    │ [6]       │
    └───────────┘
    
    df.with_columns(
        pl.when(pl.col("a").list.len() > 0).then(pl.col("a"))
    )
    
    shape: (6, 1)
    ┌───────────┐
    │ a         │
    │ ---       │
    │ list[i64] │
    ╞═══════════╡
    │ [1, 2]    │
    │ [3]       │
    │ null      │
    │ [4, 5]    │
    │ null      │
    │ [6]       │
    └───────────┘
    

    The only complication in your concrete example stems from the fact that you don't have a fixed column given but need to apply the construct to an expression. This can easily be done, by saving the pl.Expr in a intermediate variable.

    def expr_extract_phone() -> pl.Expr:
        phone_numbers_expr = (
            pl.col('telecom')
            .list.eval(
                pl.element().filter(pl.element().struct['system'] == 'phone')
                .struct['value']
            )
            .list.unique()
            .alias('phone_numbers')
        )
        # reuse expression from before
        return pl.when(phone_numbers_expr.list.len() > 0).then(phone_numbers_expr)
    

    Warning. While polars usually caches repeated sub-expressions, it currently doesn't in cases when the repeated expression is of the form pl.Expr.list.eval. Especially, phone_numbers_expr may be (partly) recomputed.