Search code examples
pythonpython-polars

How do I fill_null on a struct column?


I am trying to compare two dataframes via dfcompare = (df0 == df1) and nulls are never considered identical (unlike join there is no option to allow nulls to match).

My approach with other fields is to fill them in with an "empty value" appropriate to their datatype. What should I use for structs?

import polars as pl

df = pl.DataFrame(
    {
        "int": [1, 2, None],
        "data" : [dict(a=1,b="b"),dict(a=11,b="bb"),None]
    }
)

df.describe()
print(df)

df2 = df.with_columns(pl.col("int").fill_null(0))

df2.describe()
print(df2)

# these error out:...
try:
    df3 = df2.with_columns(pl.col("data").fill_null(dict(a=0,b="")))
except (Exception,) as e: 
    print("try#1", e)


try:
    df3 = df2.with_columns(pl.col("data").fill_null(pl.struct(dict(a=0,b=""))))
except (Exception,) as e: 
    print("try#2", e)

Output:


shape: (3, 2)
┌──────┬─────────────┐
│ int  ┆ data        │
│ ---  ┆ ---         │
│ i64  ┆ struct[2]   │
╞══════╪═════════════╡
│ 1    ┆ {1,"b"}     │
│ 2    ┆ {11,"bb"}   │
│ null ┆ {null,null} │
└──────┴─────────────┘
shape: (3, 2)
┌─────┬─────────────┐
│ int ┆ data        │
│ --- ┆ ---         │
│ i64 ┆ struct[2]   │
╞═════╪═════════════╡
│ 1   ┆ {1,"b"}     │
│ 2   ┆ {11,"bb"}   │
│ 0   ┆ {null,null} │
└─────┴─────────────┘
try#1 invalid literal value: "{'a': 0, 'b': ''}"
try#2 a

Error originated just after this operation:
DF ["int", "data"]; PROJECT */2 COLUMNS; SELECTION: "None"

My, satisfactory, workaround has been to unnest the columns instead. This works fine (even better as it allow subfield-by-subfield fills). Still, I remain curious about how to achieve a suitable "struct literal" that can be passed into these types of functions.

One can also imagine wanting to add a hardcoded column as in df4 = df.with_columns(pl.lit("0").alias("zerocol"))


Solution

  • A struct literal to use in the context of pl.Expr.fill_null can be created with pl.struct as follows.

    df.with_columns(
        pl.col("data").fill_null(
            pl.struct(a=pl.lit(1), b=pl.lit("MISSING"))
        )
    )
    
    shape: (3, 2)
    ┌──────┬───────────────┐
    │ int  ┆ data          │
    │ ---  ┆ ---           │
    │ i64  ┆ struct[2]     │
    ╞══════╪═══════════════╡
    │ 1    ┆ {1,"b"}       │
    │ 2    ┆ {11,"bb"}     │
    │ null ┆ {1,"MISSING"} │
    └──────┴───────────────┘