Generate hierarchical data from pandas df to list

I have data in this form

data = [
    [2019, "July", 8, '1.2.0', 7.0, None, None, None],
    [2019, "July", 10, '1.2.0', 52.0, "Breaking", 6.0, 'Path Removed w/o Deprecation'],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, 'Request Parameter Removed'],
    [2019, 'August', 20, '2.0.0', 100.0, "Breaking", None, None],
    [2019, 'August', 25, '2.0.0', 200.0, 'Non-breaking', None, None],
]

The list goes in this hierarchy: Year, Month, Day, info_version, API_changes, type1, count, content

I want to generate this hierarchical tree structure for the data:

{
  "name": "2020", # this is year
  "children": [
    {
      "name": "July", # this is month
      "children": [
        {
          "name": "10",   #this is day
          "children": [
            {
              "name": "1.2.0",   # this is info_version
              "value": 52,        # this is value of API_changes(always a number)
              "children": [
                {
                  "name": "Breaking",   # this is type1 column( it is string, it is either Nan or Breaking)
                  "value": 6,                   # this is value of count
                  "children": [
                    {
                      "name": "Path Removed w/o Deprecation",      #this is content column
                      "value": 6        # this is value of count
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

For all other months it continues in the same format.I do not wish to modify my data in any way whatsoever, this is how its supposed to be for my use case( graphical purposes). I am not sure how I could achieve this, any suggestions would be really grateful.

This is in reference to this format for Sunburst graph in pyecharts

Solution

Assuming that headers are known and sorted in hierarchical with description of header that must be grouped order like so (see datetime doc for its usage):

from datetime import datetime
hierarchical_description = [
    ([("name", "Year")], lambda d: int(d["name"])),
    ([("name", "Month")], lambda d: datetime.strptime(d["name"], "%B").month),
    ([("name", "Day")], None),
    ([("name", "info_version"), ("value", "API_changes")], None),
    (
        [
            ("name", "type1"),
            ("value", "count"),
        ],
        None,
    ),
    ([("name", "content"), ("value", "count")], None),
]

And that the dataframe is loaded as follows:

import pandas as pd

data = [
    [2019, "July", 8, "1.2.0", 7.0, None, None],
    [2019, "July", 10, "1.2.0", 52.0, "Breaking", 6.0, "Path Removed w/o Deprecation"],
    [2019, "July", 15, "0.1.0", 210.0, "Breaking", 57.0, "Request Parameter Removed"],
    [2019, "August", 20, "2.0.0", 100.0, "Breaking", None, None],
    [2019, "August", 25, "2.0.0", 200.0, "Non-breaking", None, None],
]

hierarchical_order = [
    "Year",
    "Month",
    "Day",
    "info_version",
    "API_changes",
    "type1",
    "count",
    "content",
]

df = pd.DataFrame(
    data,
    columns=hierarchical_order,
)

It is possible to create a recursive methods that goes hierarchically into the dataframe:

def logical_and_df(df, conditions):
    if len(conditions) == 0:
        return df
    colname, value = conditions[0]
    return logical_and_df(df[df[colname] == value], conditions[1:])


def get_hierarchical_data(df, description):
    if len(description) == 0:
        return []

    children = []
    parent_description, sorting_function_key = description[0]
    for colvalues, subdf in df.groupby([colname for _, colname in parent_description]):
        attributes = {
            key: value for (key, _), value in zip(parent_description, colvalues)
        }
        grand_children = get_hierarchical_data(
            logical_and_df(
                subdf,
                [
                    (colname, value)
                    for (_, colname), value in zip(parent_description, colvalues)
                ],
            ),
            description[1:],
        )
        if len(grand_children) > 0:
            attributes["children"] = grand_children

        children.append(attributes)

    if sorting_function_key is None:
        return children
    return sorted(children, key=sorting_function_key)

The method logical_and takes a dataframe and a list of condition. A condition is a pair where the left member is the column name and the right one is the value on that column.

The recursive method get_hierarchical_data takes the hierarchical description as input. The description, is a list of tuple. Each tuple is composed by a list that indicates the name, value column and a optional sorting key method, that will be used to order the children list. The method returns the children where value / name are based on the first element in the description. If the description is empty, it returns an empty list of children. Otherwise, it uses groupby method from pandas to look for unique pairs (see this post). A name, value dictionary is created and concatenated with the recursive call of the method looking for children.

The following lines help you printing the dictionary:

import json
print(json.dumps(get_hierarchical_data(df, hierarchical_description), indent=5))

Firstly posted version

My first version was not specific to the problem with grouped column. I edited this post to this new version that should solve your issue.