Create 'children' rows in panda dataframe, based on unique values in a list of dictionaries

Take the example:

pd.DataFrame(
            index=["r1", "r2"],
            columns=["c1","c2","c3", "group_by"],
            data=[
                ["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
                        {"x_title":"xt1","x_label":"xl1","y_title":"yt2","y_label":"yl2"}, 
                        {"x_title":"xt2","x_label":"xl2","y_title":"yt3","y_label":"yl3"}],
                        "v3","x"],
                ["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
                        {"x_title":"xt2","x_label":"xl2","y_title":"yt2","y_label":"yl2"}, 
                        {"x_title":"xt3","x_label":"xl3","y_title":"yt3","y_label":"yl3"}],
                        "v3","y"],
            ]
        )

Which produces a dataframe like:

    c1                                                 c2  c3 group_by
r1  v1  [{"x_title": "xt1", "x_label": "xl1", "y_title...  v3        x
r2  v1  [{"x_title": "xt1", "x_label": "xl1", "y_title...  v3        y

The main point of conversation will be c2, which is a list of dictionaries. I would like to "pull out" each unique value for the k,v pair where k ~= f"{parent.group_by}_title". So if group_by on the row == x, each unique x_title will have a "child" row generated that keeps all the remaining values.

The end dataframe will hopefully look similar to:

pd.DataFrame(
            index=["r1","r2","r1","r1","r2","r2","r2"],
            columns=["group","orig_name","c2","c3", "type"],
            data=[
                [None,"v1",None,"v3","parent"],
                [None,"v1",None,"v3","parent"],
                ["xt1","v1",[{"y_title":"yt1","y_label":"yl1"},{"y_title":"yt2","y_label":"yl2"}],"v3","child"],
                ["xt2","v1",[{"y_title":"yt3","y_label":"yl3"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt1","x_label":"xl1"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt2","x_label":"xl2"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt3","x_label":"xl3"}],"v3","child"],
            ]
        )

or like:

   group orig_name                                                 c2  c3  type
r1  None        v1                                               None  v3  r1  parent
r2  None        v1                                               None  v3  r2  parent
r1   xt1        v1  [{"y_title": "yt1", "y_label": "yl1"}, {"y_tit...  v3  r1   child
r1   xt2        v1             [{"y_title": "yt3", "y_label": "yl3"}]  v3  r1   child
r2   yt1        v1             [{"x_title": "xt1", "x_label": "xl1"}]  v3  r2   child
r2   yt1        v1             [{"x_title": "xt2", "x_label": "xl2"}]  v3  r2   child
r2   yt1        v1             [{"x_title": "xt3", "x_label": "xl3"}]  v3  r2   child

I'm able to do this "manually" but iterating through each row and going from there. But I would like to have a more "pandas" answer if possible. I've dabbled in ~apply(lambda ~, explode, using a def children(row) function put into a ~apply(lambda ~, and a few other methods. They are didn't quite align to what I'm interested in, and none are quite close enough to provide details.

I'm sure (as I always do) I'm somehow overthinking/looking a simple solution, just not aware of a function that will help, or just no using one of my tries correctly.

Hoping someone could explain a better way of doing this? Thank you!

Solution

There may be a more direct way of approaching this but to me it looks like you could:

turn the nested list/dicts from c2 into rows/columns
perform parent/child grouping
rebuild the c2 column

Step 1:

You can .explode("c2") then pd.json_normalize

df = df.explode("c2")
c2 = pd.json_normalize(df["c2"]).set_index(df.index)
df[c2.columns] = c2
df["c2"] = None

I've assigned c2 to None here just to simplify the output.

    c1    c2  c3 group_by x_title x_label y_title y_label
r1  v1  None  v3        x     xt1     xl1     yt1     yl1
r1  v1  None  v3        x     xt1     xl1     yt2     yl2
r1  v1  None  v3        x     xt2     xl2     yt3     yl3
r2  v1  None  v3        y     xt1     xl1     yt1     yl1
r2  v1  None  v3        y     xt2     xl2     yt2     yl2
r2  v1  None  v3        y     xt3     xl3     yt3     yl3

Step 2:

.columns.get_indexer can map column names to indexes.

>>> df["group_by"] + "_title"
r1    x_title
r1    x_title
r1    x_title
r2    y_title
r2    y_title
r2    y_title
Name: group_by, dtype: object

>>> df.columns.get_indexer(df["group_by"] + "_title")
array([4, 4, 4, 6, 6, 6])

The indexes allow you to extract from a different column on a per-row basis: How can I select a specific column from each row in a Pandas DataFrame?

This can be used to create group/key/value columns:

df["key_name"]   = df["group_by"].map({"x": "y", "y": "x"}) + "_title"
df["value_name"] = df["group_by"].map({"x": "y", "y": "x"}) + "_label"

rows = range(len(df.index))

df["group"] = df.values[rows, df.columns.get_indexer(df["group_by"] + "_title")]
df["key"]   = df.values[rows, df.columns.get_indexer(df["key_name"])]
df["value"] = df.values[rows, df.columns.get_indexer(df["value_name"])]

    c1    c2  c3 group_by x_title x_label y_title y_label key_name value_name group  key value
r1  v1  None  v3        x     xt1     xl1     yt1     yl1  y_title    y_label   xt1  yt1   yl1
r1  v1  None  v3        x     xt1     xl1     yt2     yl2  y_title    y_label   xt1  yt2   yl2
r1  v1  None  v3        x     xt2     xl2     yt3     yl3  y_title    y_label   xt2  yt3   yl3
r2  v1  None  v3        y     xt1     xl1     yt1     yl1  x_title    x_label   yt1  xt1   xl1
r2  v1  None  v3        y     xt2     xl2     yt2     yl2  x_title    x_label   yt2  xt2   xl2
r2  v1  None  v3        y     xt3     xl3     yt3     yl3  x_title    x_label   yt3  xt3   xl3

Step 3:

The parent rows:

>>> df.groupby(df.index).first()
    c1    c2  c3 group_by x_title x_label y_title y_label key_name value_name group  key value
r1  v1  None  v3        x     xt1     xl1     yt1     yl1  y_title    y_label   xt1  yt1   yl1
r2  v1  None  v3        y     xt1     xl1     yt1     yl1  x_title    x_label   yt1  xt1   xl1

The child rows:

I think there may be a smarter way to turn the columns back into dicts - but for now:

df["c2"] = [ 
   { row[f"{col}_name"]: row[col] for col in ("key", "value") }   
   for row in df[["key_name", "value_name", "key", "value"]].reset_index(drop=True).to_dict("records")
]

>>> df["c2"]
r1    {'y_title': 'yt1', 'y_label': 'yl1'}
r1    {'y_title': 'yt2', 'y_label': 'yl2'}
r1    {'y_title': 'yt3', 'y_label': 'yl3'}
r2    {'x_title': 'xt1', 'x_label': 'xl1'}
r2    {'x_title': 'xt2', 'x_label': 'xl2'}
r2    {'x_title': 'xt3', 'x_label': 'xl3'}

You can then .groupby("group") and .agg()

For .agg() you want to use list for the c2 column and first for every other column.

One way to write this is to map all column names to the default action you want and merge a dict of "overrides":

>>> columns = "a", "b", "c"
>>> dict.fromkeys(columns, "first") 
{'a': 'first', 'b': 'first', 'c': 'first'}
>>> dict.fromkeys(columns, "first") | dict(c="list")
{'a': 'first', 'b': 'first', 'c': 'list'}

On the dataframe:

>>> df.reset_index().groupby("group").agg(dict.fromkeys(df.columns, "first") | dict(c2=list))
       c1                                                 c2  c3 group_by x_title x_label y_title y_label group key_name value_name  key value
group                                                                                                                                         
xt1    v1  [{'y_title': 'yt1', 'y_label': 'yl1'}, {'y_tit...  v3        x     xt1     xl1     yt1     yl1   xt1  y_title    y_label  yt1   yl1
xt2    v1             [{'y_title': 'yt3', 'y_label': 'yl3'}]  v3        x     xt2     xl2     yt3     yl3   xt2  y_title    y_label  yt3   yl3
yt1    v1             [{'x_title': 'xt1', 'x_label': 'xl1'}]  v3        y     xt1     xl1     yt1     yl1   yt1  x_title    x_label  xt1   xl1
yt2    v1             [{'x_title': 'xt2', 'x_label': 'xl2'}]  v3        y     xt2     xl2     yt2     yl2   yt2  x_title    x_label  xt2   xl2
yt3    v1             [{'x_title': 'xt3', 'x_label': 'xl3'}]  v3        y     xt3     xl3     yt3     yl3   yt3  x_title    x_label  xt3   xl3

You could then remove any columns that are no longer needed - add a parent/child identifer column to the parent/child rows and .concat them.