Question overview
I am extracting data from a newline-delimited JSON file by applying a series of transformations. One of my transformations results in a list of values; for cases where the list is empty, I want the value to be null rather than an empty list. I have code that works, but it seems very convoluted and I'm wondering if there's a simpler way to do this that I am missing.
More detail
In each JSON object of my ndjson file, one of the data elements I'm interested in is an array of telecom
nested JSON objects.
{
... some data
"telecom":
[
{
"rank": 1,
"system": "phone",
"use": "work",
"value": "(123) 456-7890"
},
{
"rank": 2,
"system": "fax",
"use": "work",
"value": "(123) 456-7891"
}
]
... some other data
}
As part of a larger data extraction operation, I am doing:
df.select(
expr_first(),
expr_extract_phone(),
expr_others()
)
where expr_first()
, expr_extract_phone()
and expr_others()
return Polars expressions that perform some transformation on various fields of my dataset.
For expr_extract_phone()
I want to get a list of phone numbers from telecom
as follows:
telecom
array, extract value
where system=="phone"
null
rather than []
I have been able to cobble together something that works:
def expr_extract_phone() -> pl.Expr:
return pl.col('telecom').list.eval(
pl.element().filter(pl.element().struct['system'] == 'phone').struct['value']
).list.unique().alias('phone_numbers').map_batches(lambda col:
pl.LazyFrame(col).select(
pl.when(pl.col('phone_numbers').list.len() > 0)\
.then(pl.col('phone_numbers'))
).collect().get_column('phone_numbers'),
return_dtype=pl.List(pl.String),
is_elementwise=True
)
Getting the list of phone numbers seems straightforward enough, however the entire map_batches
portion to replace an empty list []
with a null value seems very convoluted. Is there a simpler way to accomplish what I'm trying to do?
For strings this SO post seems to provide a nice clean way to handle but I can't seem to find an equivalent for a list.
For a fixed column of type pl.List(...)
, you could simply use an pl.when().then()
construct as follows.
import polars as pl
df = pl.DataFrame({
"a": [[1, 2], [3], [], [4, 5], [], [6]]
})
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 2] │
│ [3] │
│ [] │
│ [4, 5] │
│ [] │
│ [6] │
└───────────┘
df.with_columns(
pl.when(pl.col("a").list.len() > 0).then(pl.col("a"))
)
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 2] │
│ [3] │
│ null │
│ [4, 5] │
│ null │
│ [6] │
└───────────┘
The only complication in your concrete example stems from the fact that you don't have a fixed column given but need to apply the construct to an expression. This can easily be done, by saving the pl.Expr
in a intermediate variable.
def expr_extract_phone() -> pl.Expr:
phone_numbers_expr = (
pl.col('telecom')
.list.eval(
pl.element().filter(pl.element().struct['system'] == 'phone')
.struct['value']
)
.list.unique()
.alias('phone_numbers')
)
# reuse expression from before
return pl.when(phone_numbers_expr.list.len() > 0).then(phone_numbers_expr)
Warning. While polars usually caches repeated sub-expressions, it currently doesn't in cases when the repeated expression is of the form pl.Expr.list.eval
. Especially, phone_numbers_expr
may be (partly) recomputed.