Polars more concise way to replace empty list with null

Question overview

I am extracting data from a newline-delimited JSON file by applying a series of transformations. One of my transformations results in a list of values; for cases where the list is empty, I want the value to be null rather than an empty list. I have code that works, but it seems very convoluted and I'm wondering if there's a simpler way to do this that I am missing.

More detail

In each JSON object of my ndjson file, one of the data elements I'm interested in is an array of telecom nested JSON objects.

{
   ... some data
    "telecom":
        [
            {
                "rank": 1,
                "system": "phone",
                "use": "work",
                "value": "(123) 456-7890"
            },
            {
                "rank": 2,
                "system": "fax",
                "use": "work",
                "value": "(123) 456-7891"
            }
        ]
    ... some other data
}

As part of a larger data extraction operation, I am doing:

df.select(
   expr_first(),
   expr_extract_phone(),
   expr_others()  
)

where expr_first(), expr_extract_phone() and expr_others() return Polars expressions that perform some transformation on various fields of my dataset.

For expr_extract_phone() I want to get a list of phone numbers from telecom as follows:

for each nested object in the telecom array, extract value where system=="phone"
collect all the individual phone numbers in a list
if the list is empty, the value of the column should be null rather than []

I have been able to cobble together something that works:

def expr_extract_phone() -> pl.Expr:
    return pl.col('telecom').list.eval(
            pl.element().filter(pl.element().struct['system'] == 'phone').struct['value']
        ).list.unique().alias('phone_numbers').map_batches(lambda col:
            pl.LazyFrame(col).select(
                pl.when(pl.col('phone_numbers').list.len() > 0)\
                .then(pl.col('phone_numbers'))
            ).collect().get_column('phone_numbers'),
            return_dtype=pl.List(pl.String),
            is_elementwise=True
        )

Getting the list of phone numbers seems straightforward enough, however the entire map_batches portion to replace an empty list [] with a null value seems very convoluted. Is there a simpler way to accomplish what I'm trying to do?

For strings this SO post seems to provide a nice clean way to handle but I can't seem to find an equivalent for a list.

Solution

For a fixed column of type pl.List(...), you could simply use an pl.when().then() construct as follows.

import polars as pl

df = pl.DataFrame({
    "a": [[1, 2], [3], [], [4, 5], [], [6]]
})

shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3]       │
│ []        │
│ [4, 5]    │
│ []        │
│ [6]       │
└───────────┘

df.with_columns(
    pl.when(pl.col("a").list.len() > 0).then(pl.col("a"))
)

shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3]       │
│ null      │
│ [4, 5]    │
│ null      │
│ [6]       │
└───────────┘

The only complication in your concrete example stems from the fact that you don't have a fixed column given but need to apply the construct to an expression. This can easily be done, by saving the pl.Expr in a intermediate variable.

def expr_extract_phone() -> pl.Expr:
    phone_numbers_expr = (
        pl.col('telecom')
        .list.eval(
            pl.element().filter(pl.element().struct['system'] == 'phone')
            .struct['value']
        )
        .list.unique()
        .alias('phone_numbers')
    )
    # reuse expression from before
    return pl.when(phone_numbers_expr.list.len() > 0).then(phone_numbers_expr)

Warning. While polars usually caches repeated sub-expressions, it currently doesn't in cases when the repeated expression is of the form pl.Expr.list.eval. Especially, phone_numbers_expr may be (partly) recomputed.