I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.
while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
for col in df:
if col.dtype == pl.Struct:
df = df.unnest(col.name)
This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.
If you have a df like this:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
You can unnest the ab
and cd
at the same time by just doing
df.unnest('ab','cd')
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension/generator like this:
df.unnest(col_name for col_name, dtype in df.schema.items() if dtype == pl.Struct)
If you have structs inside structs like:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
.select(pl.struct("ab", "cd").alias("abcd"), "e")
)
then I don't think you can get away from some kind of while loop but this might be more concise:
This method is the most concise and intuitive means to unnest nested structs as it repeatedly unnests the frame and then checks if there are structs until there aren't any more structs
while any(x==pl.Struct for x in df.dtypes):
df=df.unnest(col_name for col_name,dtype in df.schema.items() if dtype==pl.Struct)
The previous method requires repeatedly setting df. While that does NOT mean it is repeatedly copying any of the underlying data it does mean it wouldn't work on lazyframes without materializing. Because the schema contains the entire nested structure we can use the schema to generate expressions that are lazyframe friendly. There is also something satisfying in not having to reassign the df even knowing that it doesn't produce copies in the process.
def unnest_all(df):
exprs = []
def parse_struct_field(struct_field, base_expr, base_name):
ret_list = []
for field in struct_field.fields:
if field.dtype == pl.Struct:
ret_list.extend(
parse_struct_field(
field.dtype,
base_expr.struct.field(field.name),
f"{base_name}_{field.name}")
)
else:
ret_list.append(
base_expr.struct.field(field.name)
.alias(f"{base_name}_{field.name}"))
return ret_list
# first_level_structs = [x for x, y in df.schema.items() if y == pl.Struct]
for col in df.columns:
if df.schema[col]==pl.Struct:
exprs.extend(parse_struct_field(df.schema[col], pl.col(col), col))
else:
exprs.append(col)
return df.select(*exprs)
This function works by looping through the schema and creating a list of expressions for each outer column. If the outer column is a schema then it calls the parse_struct_field
function which itself loops through the inner fields of the struct creating expressions that extract the inner field as a standalone column. When one of those fields is also a struct it just calls itself using recursion to deal with any level of nesting. To deal with name conflicts it keeps track of all the outer column names and prepends that to each inner column so you end up with column names like "abcd_ab_a".
This method will fail for lazyframes that produce a struct opaquely. By that I mean methods like .list.to_struct
or str.extract_groups
where it can't produce a schema without materializing.