I am unfortunately having to work with some nested data in a polars dataframe. (I know it is bad practice) Consider data:
data = {
"positions": [
{
"company": {
"companyName": "name1"
},
},
{
"company": {
"companyName": "name2"
},
},
{
"company": {
"companyName": "name3"
},
}
]
}
positions
is a column in the dataframe. I have explored the polars python api docs but cannot figure out how to extract out just the companyName
fields into a separate list column.
I want to achieve the same that this comprehension does:
names = (
[
p["company"]["companyName"]
for p in data["positions"]
if p.get("company") and p.get("company").get("companyName")
]
if data.get("positions")
else None
)
Note the null checks.
I get a sense that I have to use the pl.list.eval
function along with pl.element
but I am a bit foggy on the api.
Before:
shape: (3, 1)
┌─────────────┐
│ positions │
│ --- │
│ struct[1] │
╞═════════════╡
│ {{"name1"}} │
│ {{"name2"}} │
│ {{"name3"}} │
└─────────────┘
After:
shape: (3, 1)
┌───────┐
│ names │
│ --- │
│ str │
╞═══════╡
│ name1 │
│ name2 │
│ name3 │
└───────┘
You can use .struct.field()
or .struct[]
syntax to extract struct fields.
df = pl.DataFrame(data)
df.with_columns(
pl.col("positions").struct["company"].struct["companyName"]
)
shape: (3, 2)
┌─────────────┬─────────────┐
│ positions ┆ companyName │
│ --- ┆ --- │
│ struct[1] ┆ str │
╞═════════════╪═════════════╡
│ {{"name1"}} ┆ name1 │
│ {{"name2"}} ┆ name2 │
│ {{"name3"}} ┆ name3 │
└─────────────┴─────────────┘
Alternatively, you can work at the frame-level and .unnest()
the structs into columns.
df.unnest("positions").unnest("company")
shape: (3, 1)
┌─────────────┐
│ companyName │
│ --- │
│ str │
╞═════════════╡
│ name1 │
│ name2 │
│ name3 │
└─────────────┘
If working with a list of structs you could use the .list.eval()
API:
df = pl.DataFrame([data])
df.with_columns(
pl.col("positions").list.eval(
pl.element().struct["company"].struct["companyName"]
)
)
shape: (1, 1)
┌─────────────────────────────┐
│ positions │
│ --- │
│ list[str] │
╞═════════════════════════════╡
│ ["name1", "name2", "name3"] │
└─────────────────────────────┘
Or at the frame-level using .explode()
and .unnest()
df.explode("positions").unnest("positions").unnest("company")
shape: (3, 1)
┌─────────────┐
│ companyName │
│ --- │
│ str │
╞═════════════╡
│ name1 │
│ name2 │
│ name3 │
└─────────────┘