Search code examples
pythonpandasgroup-by

Recursive .groupby to achieve a dict of dicts of dicts


I have following sample dataset as a pandas DataFrame, each key is unique to its namespace and each value is also unique to its key:

image namespace key value
0 img1 ns1 organism human
1 img1 ns1 organ liver
2 img1 ns2 microscope confocal
3 img2 ns2 microscope confocal
4 img2 ns2 technique widefield
5 img2 ns2 technique phase-contrast
6 img2 ns4 analysis segmentation

and I try to get a dict of dicts of dicts of list out of it. The ideal outcome would look like:

{"img1":{"ns1":{"organism":["human"],"organ":["liver"]},
"ns2":{"microscope":["confocal"]}},
"img2":{"ns2":{"microscope":["confocal"],"technique":["widefield","phase-contrast"]},
"ns4":{"analysis":["segmentation"]}}}

I am sure I can somehow achieve this by recursive .groupby but I have tried and failed multiple times. Please, can someone more competent point out an obvious answer?


Solution

  • Not being a Pandas wizard, I would simply iterate over the rows using setdefault() to build your nested dictionary. In fact, I might be tempted to bypass pandas altogether.

    import pandas
    df = pandas.DataFrame([
        {"image": "img1", "namespace": "ns1", "key": "organism", "value": "human"},
        {"image": "img1", "namespace": "ns1", "key": "organ", "value": "liver"},
        {"image": "img1", "namespace": "ns2", "key": "microscope", "value": "confocal"},
        {"image": "img2", "namespace": "ns2", "key": "microscope", "value": "confocal"},
        {"image": "img2", "namespace": "ns2", "key": "technique", "value": "widefield"},
        {"image": "img2", "namespace": "ns2", "key": "technique", "value": "phase-contrast"},
        {"image": "img2", "namespace": "ns4", "key": "analysis", "value": "segmentation"},
    ])
    
    results = {}
    for _, row in df.iterrows():
        results \
            .setdefault(row["image"], {}) \
            .setdefault(row["namespace"], {}) \
            .setdefault(row["key"], []) \
            .append(row["value"])
    
    import json  
    print(json.dumps(results, indent=4))
    

    That will give you:

    {
        "img1": {
            "ns1": {
                "organism": [
                    "human"
                ],
                "organ": [
                    "liver"
                ]
            },
            "ns2": {
                "microscope": [
                    "confocal"
                ]
            }
        },
        "img2": {
            "ns2": {
                "microscope": [
                    "confocal"
                ],
                "technique": [
                    "widefield",
                    "phase-contrast"
                ]
            },
            "ns4": {
                "analysis": [
                    "segmentation"
                ]
            }
        }
    }