Search code examples
pythonpandasdataframelambdaexplode

Create 'children' rows in panda dataframe, based on unique values in a list of dictionaries


Take the example:

pd.DataFrame(
            index=["r1", "r2"],
            columns=["c1","c2","c3", "group_by"],
            data=[
                ["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
                        {"x_title":"xt1","x_label":"xl1","y_title":"yt2","y_label":"yl2"}, 
                        {"x_title":"xt2","x_label":"xl2","y_title":"yt3","y_label":"yl3"}],
                        "v3","x"],
                ["v1",[{"x_title":"xt1","x_label":"xl1","y_title":"yt1","y_label":"yl1"},
                        {"x_title":"xt2","x_label":"xl2","y_title":"yt2","y_label":"yl2"}, 
                        {"x_title":"xt3","x_label":"xl3","y_title":"yt3","y_label":"yl3"}],
                        "v3","y"],
            ]
        )

Which produces a dataframe like:

    c1                                                 c2  c3 group_by
r1  v1  [{"x_title": "xt1", "x_label": "xl1", "y_title...  v3        x
r2  v1  [{"x_title": "xt1", "x_label": "xl1", "y_title...  v3        y

The main point of conversation will be c2, which is a list of dictionaries. I would like to "pull out" each unique value for the k,v pair where k ~= f"{parent.group_by}_title". So if group_by on the row == x, each unique x_title will have a "child" row generated that keeps all the remaining values.

The end dataframe will hopefully look similar to:

pd.DataFrame(
            index=["r1","r2","r1","r1","r2","r2","r2"],
            columns=["group","orig_name","c2","c3", "type"],
            data=[
                [None,"v1",None,"v3","parent"],
                [None,"v1",None,"v3","parent"],
                ["xt1","v1",[{"y_title":"yt1","y_label":"yl1"},{"y_title":"yt2","y_label":"yl2"}],"v3","child"],
                ["xt2","v1",[{"y_title":"yt3","y_label":"yl3"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt1","x_label":"xl1"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt2","x_label":"xl2"}],"v3","child"],
                ["yt1","v1",[{"x_title":"xt3","x_label":"xl3"}],"v3","child"],
            ]
        )

or like:

   group orig_name                                                 c2  c3  type
r1  None        v1                                               None  v3  r1  parent
r2  None        v1                                               None  v3  r2  parent
r1   xt1        v1  [{"y_title": "yt1", "y_label": "yl1"}, {"y_tit...  v3  r1   child
r1   xt2        v1             [{"y_title": "yt3", "y_label": "yl3"}]  v3  r1   child
r2   yt1        v1             [{"x_title": "xt1", "x_label": "xl1"}]  v3  r2   child
r2   yt1        v1             [{"x_title": "xt2", "x_label": "xl2"}]  v3  r2   child
r2   yt1        v1             [{"x_title": "xt3", "x_label": "xl3"}]  v3  r2   child

I'm able to do this "manually" but iterating through each row and going from there. But I would like to have a more "pandas" answer if possible. I've dabbled in ~apply(lambda ~, explode, using a def children(row) function put into a ~apply(lambda ~, and a few other methods. They are didn't quite align to what I'm interested in, and none are quite close enough to provide details.

I'm sure (as I always do) I'm somehow overthinking/looking a simple solution, just not aware of a function that will help, or just no using one of my tries correctly.

Hoping someone could explain a better way of doing this? Thank you!


Solution

  • There may be a more direct way of approaching this but to me it looks like you could:

    • turn the nested list/dicts from c2 into rows/columns
    • perform parent/child grouping
    • rebuild the c2 column

    Step 1:

    You can .explode("c2") then pd.json_normalize

    df = df.explode("c2")
    c2 = pd.json_normalize(df["c2"]).set_index(df.index)
    df[c2.columns] = c2
    df["c2"] = None
    

    I've assigned c2 to None here just to simplify the output.

        c1    c2  c3 group_by x_title x_label y_title y_label
    r1  v1  None  v3        x     xt1     xl1     yt1     yl1
    r1  v1  None  v3        x     xt1     xl1     yt2     yl2
    r1  v1  None  v3        x     xt2     xl2     yt3     yl3
    r2  v1  None  v3        y     xt1     xl1     yt1     yl1
    r2  v1  None  v3        y     xt2     xl2     yt2     yl2
    r2  v1  None  v3        y     xt3     xl3     yt3     yl3
    

    Step 2:

    .columns.get_indexer can map column names to indexes.

    >>> df["group_by"] + "_title"
    r1    x_title
    r1    x_title
    r1    x_title
    r2    y_title
    r2    y_title
    r2    y_title
    Name: group_by, dtype: object
    
    >>> df.columns.get_indexer(df["group_by"] + "_title")
    array([4, 4, 4, 6, 6, 6])
    

    The indexes allow you to extract from a different column on a per-row basis: How can I select a specific column from each row in a Pandas DataFrame?

    This can be used to create group/key/value columns:

    df["key_name"]   = df["group_by"].map({"x": "y", "y": "x"}) + "_title"
    df["value_name"] = df["group_by"].map({"x": "y", "y": "x"}) + "_label"
    
    rows = range(len(df.index))
    
    df["group"] = df.values[rows, df.columns.get_indexer(df["group_by"] + "_title")]
    df["key"]   = df.values[rows, df.columns.get_indexer(df["key_name"])]
    df["value"] = df.values[rows, df.columns.get_indexer(df["value_name"])]
    
        c1    c2  c3 group_by x_title x_label y_title y_label key_name value_name group  key value
    r1  v1  None  v3        x     xt1     xl1     yt1     yl1  y_title    y_label   xt1  yt1   yl1
    r1  v1  None  v3        x     xt1     xl1     yt2     yl2  y_title    y_label   xt1  yt2   yl2
    r1  v1  None  v3        x     xt2     xl2     yt3     yl3  y_title    y_label   xt2  yt3   yl3
    r2  v1  None  v3        y     xt1     xl1     yt1     yl1  x_title    x_label   yt1  xt1   xl1
    r2  v1  None  v3        y     xt2     xl2     yt2     yl2  x_title    x_label   yt2  xt2   xl2
    r2  v1  None  v3        y     xt3     xl3     yt3     yl3  x_title    x_label   yt3  xt3   xl3
    

    Step 3:

    The parent rows:

    >>> df.groupby(df.index).first()
        c1    c2  c3 group_by x_title x_label y_title y_label key_name value_name group  key value
    r1  v1  None  v3        x     xt1     xl1     yt1     yl1  y_title    y_label   xt1  yt1   yl1
    r2  v1  None  v3        y     xt1     xl1     yt1     yl1  x_title    x_label   yt1  xt1   xl1
    

    The child rows:

    I think there may be a smarter way to turn the columns back into dicts - but for now:

    df["c2"] = [ 
       { row[f"{col}_name"]: row[col] for col in ("key", "value") }   
       for row in df[["key_name", "value_name", "key", "value"]].reset_index(drop=True).to_dict("records")
    ]
    
    >>> df["c2"]
    r1    {'y_title': 'yt1', 'y_label': 'yl1'}
    r1    {'y_title': 'yt2', 'y_label': 'yl2'}
    r1    {'y_title': 'yt3', 'y_label': 'yl3'}
    r2    {'x_title': 'xt1', 'x_label': 'xl1'}
    r2    {'x_title': 'xt2', 'x_label': 'xl2'}
    r2    {'x_title': 'xt3', 'x_label': 'xl3'}
    

    You can then .groupby("group") and .agg()

    For .agg() you want to use list for the c2 column and first for every other column.

    One way to write this is to map all column names to the default action you want and merge a dict of "overrides":

    >>> columns = "a", "b", "c"
    >>> dict.fromkeys(columns, "first") 
    {'a': 'first', 'b': 'first', 'c': 'first'}
    >>> dict.fromkeys(columns, "first") | dict(c="list")
    {'a': 'first', 'b': 'first', 'c': 'list'}
    

    On the dataframe:

    >>> df.reset_index().groupby("group").agg(dict.fromkeys(df.columns, "first") | dict(c2=list))
           c1                                                 c2  c3 group_by x_title x_label y_title y_label group key_name value_name  key value
    group                                                                                                                                         
    xt1    v1  [{'y_title': 'yt1', 'y_label': 'yl1'}, {'y_tit...  v3        x     xt1     xl1     yt1     yl1   xt1  y_title    y_label  yt1   yl1
    xt2    v1             [{'y_title': 'yt3', 'y_label': 'yl3'}]  v3        x     xt2     xl2     yt3     yl3   xt2  y_title    y_label  yt3   yl3
    yt1    v1             [{'x_title': 'xt1', 'x_label': 'xl1'}]  v3        y     xt1     xl1     yt1     yl1   yt1  x_title    x_label  xt1   xl1
    yt2    v1             [{'x_title': 'xt2', 'x_label': 'xl2'}]  v3        y     xt2     xl2     yt2     yl2   yt2  x_title    x_label  xt2   xl2
    yt3    v1             [{'x_title': 'xt3', 'x_label': 'xl3'}]  v3        y     xt3     xl3     yt3     yl3   yt3  x_title    x_label  xt3   xl3
    

    You could then remove any columns that are no longer needed - add a parent/child identifer column to the parent/child rows and .concat them.