Trying to figure out how to transform a k-v string that is inside a column where the k-v string is separated by commas, and could contain different keys. The different keys would then be transformed into their own columns, where missing values would contain nulls.
For example,
pl.DataFrame({
"apple": [1, 2, 3],
"data": ["a=b, b=c", "a=y, y=z", "k1=v1, k2=v2"]
})
would look like:
pl.DataFrame({
"apple": [1, 2, 3],
"a": ["b", "y", None],
"b": ["c", None, None],
"y": [None, "z", None],
"k1": [None, None, "v1"],
"k2": [None, None, "v2"],
"data": ["a=b, b=c", "a=y, y=z", "k1=v1, k2=v2"]
})
once transformed. Does anyone know what is the most efficient way to do this (perhaps without pre-processing of the data, if possible?)
You could attempt to reformat it as JSON objects.
df.with_columns(pl.format('{"{}"}',
pl.col("data").str.replace_many({"=": '":"', ", ": '","'})
))
shape: (3, 3)
┌───────┬──────────────┬───────────────────────┐
│ apple ┆ data ┆ literal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═══════╪══════════════╪═══════════════════════╡
│ 1 ┆ a=b, b=c ┆ {"a":"b","b":"c"} │
│ 2 ┆ a=y, y=z ┆ {"a":"y","y":"z"} │
│ 3 ┆ k1=v1, k2=v2 ┆ {"k1":"v1","k2":"v2"} │
└───────┴──────────────┴───────────────────────┘
Which would allow you to .str.json_decode()
into a struct.
And then .unnest()
into columns.
df.with_columns(
pl.format('{"{}"}',
pl.col("data").str.replace_many({"=": '":"', ", ": '","'})
)
.str.json_decode()
.alias("json")
).unnest("json")
shape: (3, 7)
┌───────┬──────────────┬──────┬──────┬──────┬──────┬──────┐
│ apple ┆ data ┆ a ┆ b ┆ y ┆ k1 ┆ k2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═══════╪══════════════╪══════╪══════╪══════╪══════╪══════╡
│ 1 ┆ a=b, b=c ┆ b ┆ c ┆ null ┆ null ┆ null │
│ 2 ┆ a=y, y=z ┆ y ┆ null ┆ z ┆ null ┆ null │
│ 3 ┆ k1=v1, k2=v2 ┆ null ┆ null ┆ null ┆ v1 ┆ v2 │
└───────┴──────────────┴──────┴──────┴──────┴──────┴──────┘