Search code examples
pythonpython-3.xalgorithmdata-structures

Create a hierarchy data structure from list of data


I want to write a python script to convert my list of data into hierarchy data structure. So to explain more, the first data will always be the root node,and each data has its own id, and user_id is the id of the ancestor. So if the data's user_id is others id then the data should be the children node of the others node

This is my original data

[
    {
        "user_id": 1164146,
        "id": 1158720,
        "ancestors": ",8,77169,1164146,",
        <OTHER DATA>
    },
    {
        "user_id": 1158720,
        "id": 1259377,
        "ancestors": ",8,77169,1164146,1158720,",
        <OTHER DATA>
    },
    {
        "user_id": 1259377,
        "id": 1259378,
        "ancestors": ",8,77169,1164146,1158720,1259377,",
        <OTHER DATA>
    },
    {
        "user_id": 1158720,
        "id": 1259379,
        "ancestors": ",8,77169,1164146,1158720,",
        <OTHER DATA>
    }
]

And my result data is

[
    {
        "user_id": 1164146,
        "id": 1158720,
        "ancestors": ",8,77169,1164146,",
        <OTHER DATA>
        "children": [
            {
                "user_id": 1158720,
                "id": 1259377,
                "ancestors": ",8,77169,1164146,1158720,",
                <OTHER DATA>
                "children": [
                    {
                        "user_id": 1259377,
                        "id": 1259378,
                        "ancestors": ",8,77169,1164146,1158720,1259377,",
                        <OTHER DATA>
                        "children": []
                    }
                ]
            },
            {
                "user_id": 1158720,
                "id": 1259379,
                "ancestors": ",8,77169,1164146,1158720,",
                <OTHER DATA>
                "children": []
            }
        ]
    }
]

From this example, the first data will be the root node, the second data (id:1259377) will be the child of first data (id:1158720) because the second node user_id is 1158720 which is the id of the first data. The actual data that I working on has more hierarchy level than this. Thank you for helping

I try to use ChatGPT to generate the code and it gave this code but the result isn't correct

def create_nested_structure(data):
    nested_data = []

    # The first data is the root node
    root = data[0]
    root["children"] = []

    for item in data[1:]:
        user_id = item["user_id"]

        # Find the parent node based on user_id
        parent = next((node for node in nested_data if node["user_id"] == user_id), None)

        if parent:
            if "children" not in parent:
                parent["children"] = []
            parent["children"].append(item)
        else:
            # If parent not found, add to the root node
            root["children"].append(item)

    nested_data.append(root)

    return nested_data

Solution

  • The language model produced code that has a few mistakes:

    • It never checks the "id" attribute, so it could never correctly identify a parent-child relationship. The comparison node["user_id"] == user_id should be node["id"] == user_id

    • The iterator passed to next() iterates an empty collection nested_data, which of course makes no sense, as that means parent will always be assigned None. Only at the end of the function an item is appended to that nested_data collection. The iterator should iterate data, not nested_data.

    • The children attribute is only added when there is at least one child, but your expected output also has that attribute in leaf-nodes.

    • It assumes that the root node must be the first entry in the data input (to fair, you specified this). But we can imagine this is not necessarily so, and that there could even be more than one root (a forest).

    • It looks for a matching parent by iterating the data -- for each node it visits -- which makes the algorithm inefficient for larger inputs. Instead of repeatedly calling next, we should use a dictionary for retrieving nodes by their id.

    Here is how I would suggest doing it:

    def create_nested_structure(data):
        # Define a "children" attribute for all nodes, and key them by id
        nodes = { obj["id"]: {**obj, "children": []} for obj in data }
        # Don't assume that the root is at index 0, nor that there is just one
        roots = []
        for obj in nodes.values():
            # If this node is a child...
            if obj["user_id"] in nodes:
                # ...then look up the parent in the dictionary, and connect them
                nodes[obj["user_id"]]["children"].append(obj)
            else:
                # ...otherwise it is a root
                roots.append(obj)
        return roots
    

    As your input could define a hierarchy that has more than one root, the function returns a list of nodes, not a single node.

    (and some day, a next language model will be fed with more data, including this, and not make that mistake again).