Search code examples
pythonjsonpandastreeanytree

Python script to extract data from an excel and put it in trees


I have a python script that extracts data from an excel, more precisely, data from three columns: Finished Good, Parent Part Code and Material Code. The three columns look like this:

Material Code    Parent Part Code   Finished Good
M1               P1                 F1
M2               P2                 F2
M3               M2                 F2
M4               P3                 F2
.....

Note: Material Code can also be in the Parent Part Code column.

The tree should look like this:

F1
  P1
    M1

F2
  P2
    M2
      M3
  P3
    M4

The Python script is as follows:

import pandas as pd
from anytree import Node, RenderTree
import json

# Read excel
df = pd.read_excel('excelFile.xlsx')

root_dict = {}

for index, row in df.iterrows():
    finished_good = row['Finished Good']
    parent_part_code = row['Parent Part Code']
    material_code = row['Material Code']
    
    if finished_good not in root_dict:
        # If not, add as root
        root = Node(finished_good)
        root_dict[finished_good] = root
    else:
        # If Finished Good is in the tree, get the root
        root = root_dict[finished_good]

    if parent_part_code in [node.name for node in root.descendants]:
        # If parent node already exists in the tree, find it and add children
        parent_node = [node for node in root.descendants if node.name == parent_part_code][0]
        
        if material_code in [node.name for node in root.descendants]:
            # If Material Code already exists in the tree, find its sub-tree and copy it to the current node
            material_node = [node for node in root.descendants if node.name == material_code][0]
            material_node.parent = parent_node
        else:
            # If Material Code does not exist in the tree, create a new sub-tree
            material = Node(material_code, parent=parent_node)
            while material_code in df['Parent Part Code'].values:
                filtered = df[df['Parent Part Code'] == material_code]
                parent_code = filtered.iloc[0]['Parent Part Code']
                material_code = filtered.iloc[0]['Material Code']
                parent_node = material
                material = Node(material_code, parent=parent_node)
    else:
        # If parent node doesn't exist in the tree, create new sub-tree
        parent_node = Node(parent_part_code, parent=root)
        material = Node(material_code, parent=parent_node)
        while material_code in df['Parent Part Code'].values:
            filtered = df[df['Parent Part Code'] == material_code]
            parent_code = filtered.iloc[0]['Parent Part Code']
            material_code = filtered.iloc[0]['Material Code']
            parent_node = material
            material = Node(material_code, parent=parent_node)

# Print the trees
for root in root_dict.values():
    print(RenderTree(root))

# Save the trees
def node_to_dict(node):
 
    return {
        'name': node.name,
        'children': [node_to_dict(child) for child in node.children]
    }

with open('normTrees.json', 'w') as file:
    json.dump({key: node_to_dict(root) for key, root in root_dict.items()}, file)
    print("The trees were successfully saved")

The output results are:

Node('/10020115HU')
├── Node('/10020115HU/V371')
│   ├── Node('/10020115HU/V371/YG10-30300')
│   ├── Node('/10020115HU/V371/VECTC002')
│   │   ├── Node('/10020115HU/V371/VECTC002/YG10-30200')
│   │   ├── Node('/10020115HU/V371/VECTC002/VNCTC002')
│   │   │   ├── Node('/10020115HU/V371/VECTC002/VNCTC002/YG10-30300')
│   │   │   ├── Node('/10020115HU/V371/VECTC002/VNCTC002/SZVIZ')
│   │   │   └── Node('/10020115HU/V371/VECTC002/VNCTC002/RVSZALLPOR')
│   │   └── Node('/10020115HU/V371/VECTC002/RVECTC002')
│   └── Node('/10020115HU/V371/U100KOCS')
│       └── Node('/10020115HU/V371/U100KOCS/YG10-30300')
├── Node('/10020115HU/C00211L0')
│   ├── Node('/10020115HU/C00211L0/V371')
│   │   └── Node('/10020115HU/C00211L0/V371/YG10-30300')
│   ├── Node('/10020115HU/C00211L0/RWINNOVERS')
│   └── Node('/10020115HU/C00211L0/RSZENNYEZETT')
├── Node('/10020115HU/10020115HU')
│   └── Node('/10020115HU/10020115HU/TTK00001HU')
├── Node('/10020115HU/D67AMBR910')
│   └── Node('/10020115HU/D67AMBR910/RWINNOVERS')
└── Node('/10020115HU/D67LTRR910')
    ├── Node('/10020115HU/D67LTRR910/RWINNOVERS')
    └── Node('/10020115HU/D67LTRR910/RSECONDUST')
The trees were successfully saved

The script does not work as expected because code V371 appears as a child node for Finished Good 10020115HU and as a sub-node for code C00211L0.

I would need the script to be modified so that it displays the tree as follows:

Node('/10020115HU')
├── Node('/10020115HU/C00211L0')
│   ├── Node('/10020115HU/C00211L0/V371')
│   │   └── Node('/10020115HU/C00211L0/V371/YG10-30300')
|   |     └── Node('/10020115HU/C00211L0/V371/VECTC002')
|   |   |    └── Node('/10020115HU/C00211L0/V371/VECTC002/YG10-30200')
      |  └── Node('/10020115HU/C00211L0//V371/VECTC002/VNCTC002')
│   │   |    │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002/YG10-30300')
│   │   |    │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002/SZVIZ')
│   │   |    │   └── Node('10020115HU/C00211L0/V371/VECTC002/VNCTC002/RVSZALLPOR')
│   │   |    └── Node('/10020115HU/C00211L0/V371/VECTC002/RVECTC002')
│   |   Node('/10020115HU/C00211L0/V371/U100KOCS')
│   |     └── Node('/10020115HU/C00211L0/V371/U100KOCS/YG10-30300')
│   ├── Node('/10020115HU/C00211L0/RWINNOVERS')
│   └── Node('/10020115HU/C00211L0/RSZENNYEZETT')
├── Node('/10020115HU/10020115HU')
│   └── Node('/10020115HU/10020115HU/TTK00001HU')
├── Node('/10020115HU/D67AMBR910')
│   └── Node('/10020115HU/D67AMBR910/RWINNOVERS')
└── Node('/10020115HU/D67LTRR910')
    ├── Node('/10020115HU/D67LTRR910/RWINNOVERS')
    └── Node('/10020115HU/D67LTRR910/RSECONDUST')

Solution

  • The problem is that your input may list Parent Part Codes that (much) later occur as Material Code ("V371" is in that case), which means they are not a direct child of the root. If this later occurrence includes a Parent Part Code that is new, then a new node is created (unconditionally) for that Material Code, and so in this case you have two nodes with the name "V371" in your tree.

    It can be quite tedious to do this in one pass. I would suggest doing this in two passes:

    1. Create the root node and the nodes for all Parent Part Codes (assuming at first that they are direct children of the root)
    2. Create the nodes for Material Codes (if they don't exist yet as Parent Part Code node). Attach each of them to the node we have for the associated Parent Part Code (from the first phase).

    This looks easier:

    root_dict = {}
    for sweep in range(2):
        for index, row in df.iterrows():
            finished_good = row['Finished Good']
            parent_part_code = row['Parent Part Code']
            material_code = row['Material Code']
        
            if finished_good not in root_dict:
                root_dict[finished_good] = Node(finished_good), {}
            root, nodes = root_dict[finished_good]
            if parent_part_code not in nodes:
                nodes[parent_part_code] = Node(parent_part_code, parent=root)
            if sweep:
                child = nodes[material_code] if material_code in nodes else Node(material_code)
                child.parent = nodes[parent_part_code]
    

    After this has executed, the tree is ready to print. As the above code associates a nodes dictionary for each tree, the values of the root_dict dictionary are now tuples (pairs), so the loop should extract the root from that tuple:

    for root, _ in root_dict.values():
        print(RenderTree(root))
    

    Output:

    Node('/10020115HU')
    ├── Node('/10020115HU/C00211L0')
    │   ├── Node('/10020115HU/C00211L0/V371')
    │   │   ├── Node('/10020115HU/C00211L0/V371/YG10-30300')
    │   │   ├── Node('/10020115HU/C00211L0/V371/VECTC002')
    │   │   │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/YG10-30200')
    │   │   │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002')
    │   │   │   │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002/YG10-30300')
    │   │   │   │   ├── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002/SZVIZ')
    │   │   │   │   └── Node('/10020115HU/C00211L0/V371/VECTC002/VNCTC002/RVSZALLPOR')
    │   │   │   └── Node('/10020115HU/C00211L0/V371/VECTC002/RVECTC002')
    │   │   ├── Node('/10020115HU/C00211L0/V371/U100KOCS')
    │   │   │   └── Node('/10020115HU/C00211L0/V371/U100KOCS/YG10-30300')
    │   │   └── Node('/10020115HU/C00211L0/V371/YG10-30300')
    │   ├── Node('/10020115HU/C00211L0/RWINNOVERS')
    │   └── Node('/10020115HU/C00211L0/RSZENNYEZETT')
    ├── Node('/10020115HU/10020115HU')
    │   └── Node('/10020115HU/10020115HU/TTK00001HU')
    ├── Node('/10020115HU/D67AMBR910')
    │   └── Node('/10020115HU/D67AMBR910/RWINNOVERS')
    └── Node('/10020115HU/D67LTRR910')
        ├── Node('/10020115HU/D67LTRR910/RWINNOVERS')
        └── Node('/10020115HU/D67LTRR910/RSECONDUST')