Search code examples
pythonpandashierarchy

Drawing hierarchy of items with scores


I am new to Python and programming in general, so would appreciate any help. I am trying to use Python (ideally with the use of Pandas) to do the following:

Data

I have a table that looks like this:

+--------------------+-------+
|    Parent:Child    | Score |
+--------------------+-------+
| Life:Work          |     3 |
| Work:Money         |     2 |
| Work:Hours         |     3 |
| Work:Hours         |     2 |
| Life:Health        |     2 |
| Money:Life savings |     3 |
+--------------------+-------+

Desired Outputs

  1. Table: Determine unique items and calculate average scores:

Scores are averaged where there are multiple entries

+--------------+---------------+
|  Unique item | Average score |
+--------------+---------------+
| Life         | NaN           |
| Work         | 3             |
| Health       | 2             |
| Money        | 2             |
| Hours        | 2.5           |
| Life savings | 3             |
+--------------+---------------+
  1. Tree:

a) Determine hierarchy of the items:

Life > Work > Money > Life savings

Life > Work > Hours

Life > Health

b) draw tree with items and average scores:

                 Life (NaN)
              /              \
      Work (3)               Health (2)
       /       \  
 Money (2)  Hours (2.5)
      | 
Life savings (3)

Some notes:

In the data, colon (":") denotes the relationship between the items. The format is Parent:Child

"Life" has no score, so it should return NaN

"Hours" has two entries in the data, so the average is shown" (2+3)/2 = 2.5

Greatly appreciate your help!

EDITED Thanks AKX for your helpful response. There's just one part which is not solved so I will clarify here. For 2) Tree: a) Determine hierarchy of the items:

The raw data doesn't specify which layer the Parent:Child is at. The issue here is to write code that can figure this out and link them. From "Life:Work" and "Work:Money", we need to figure out that the child of the first entry ("Work") matches the parent of the second entry ("Money"). ie:

From:
Life:Work
Work:Money

Combines into:
Life:Work:Money

Ultimately, from Raw data:

+--------------------+-------+
|    Parent:Child    | Score |
+--------------------+-------+
| Life:Work          |     3 |
| Work:Money         |     2 |
| Work:Hours         |     3 |
| Work:Hours         |     2 |
| Life:Health        |     2 |
| Money:Life savings |     3 |
+--------------------+-------+

Create a table like this:


+--------+--------+--------+--------------+-----------+------------------------------------------------------------------------------------------------------------------------------------+
| Layer1 | Layer2 | Layer3 |    Layer4    | Avg Score |                                                             #Comments                                                              |
+--------+--------+--------+--------------+-----------+------------------------------------------------------------------------------------------------------------------------------------+
| Life   | Work   |        |              | 3         | #Directly from "Life:Work" in raw data                                                                                             |
| Life   | Work   | Money  |              | 2         | #Entry Work:Money has score 2. Since there is an entry "Life:Work", we know "Work" isn't an ultimate parent, and sits below "Life" |
| Life   | Work   | Money  | Life savings | 3         | #Entry "Money:Life savings" has score 3. Similarly, we know from other entries that the hierarchy is Life > Work > Money           |
| Life   | Work   | Hours  |              | 2.5       | #There're entries "Work:Money" and another "Work:Hours", so we know both "Money" and "Hours" are direct children of "Work"         |
| Life   | Health |        |              | 2         | #Directly from "Life:Health" which has score 2. And there is no entry above "Life", which makes it the top of the hierarchy        |
| Life   |        |        |              | NaN       | #There is no entry where "Life" is a child, so "Life" is an ultimate parent. Also, no entry tells us the score for "Life"          |
+--------+--------+--------+--------------+-----------+------------------------------------------------------------------------------------------------------------------------------------+


Then from this table, we should be able to create the tree (format doesn't matter).

                 Life (NaN)
              /              \
      Work (3)               Health (2)
       /       \  
 Money (2)  Hours (2.5)
      | 
Life savings (3)

Again, appreciate any help!


Solution

  • Here's a thing using that asciitree library I mentioned.

    Turns out it made it rather easy to print out a custom value for each tree node, which is exactly what we wanted here.

    I tried to add helpful comments where I could.

    from asciitree import LeftAligned, DictTraversal
    
    import pandas as pd
    from collections import defaultdict
    
    
    class ShowValueTraversal(DictTraversal):
        def __init__(self, values):
            self.values = values
    
        def get_text(self, node):
            key = node[0]
            if key in self.values:
                return f"{key} ({self.values[key]})"
            return key
    
    
    def treeify(averages_dict):
        # Make a recursive tree we can just add children to
        make_tree = lambda: defaultdict(make_tree)
        tree = make_tree()
    
        for tag, value in averages_dict.items():
            parent = tree
            parts = tag.split(":")
            for i in range(len(parts) + 1):
                joined_tag = ":".join(parts[:i])
                parent = parent[joined_tag]
        return tree
    
    
    def fixup_names(dct):
        # Break down the keys on colons
        dct = {tuple(key.split(":")): value for (key, value) in dct.items()}
        # Get a mapping of the last "atoms" of each known name to their full name
        last_atom_map = {p[-1]: p for p in dct}
    
        # Walk through the original dictionary, replacing any known first atom with
        # an entry from the last atom map if possible and reconstitute the keys
        new_dct = {}
        for key, value in dct.items():
            key_parts = list(key)
            while key_parts[0] in last_atom_map:
                # Slice in the new prefix
                key_parts[0:1] = last_atom_map[key_parts[0]]
            new_key = ":".join(key_parts)
            new_dct[new_key] = value
        return new_dct
    
    
    df = pd.DataFrame(
        [
            ("Life:Work", 3),
            ("Work:Money", 2),
            ("Work:Hours", 3),
            ("Work:Hours", 2),
            ("Life:Health", 2),
            ("Money:Life savings", 3),
            ("Money:Something", 2),
            ("Money:Something:Deeper", 1),
        ],
        columns=["tag", "value"],
    )
    
    print("# Original data")
    print(df)
    print()
    
    print("# Averages")
    df_averages = df.groupby("tag").mean()
    print(df_averages)
    print()
    
    # Turn the averages into a dict of tag -> value
    averages_dict = dict(df_averages.itertuples())
    
    # Fix up the names (to infer hierarchy)
    averages_dict = fixup_names(averages_dict)
    
    # Generate a tree out of the flat data
    tree = treeify(averages_dict)
    # Instantiate a custom asciitree traversal object that knows how to
    # look up the values from the dict
    traverse = ShowValueTraversal(values=averages_dict)
    
    # Print it out!
    print("# Tree")
    print(LeftAligned(traverse=traverse)(tree))
    
    

    The output is

    # Original data
                          tag  value
    0               Life:Work      3
    1              Work:Money      2
    2              Work:Hours      3
    3              Work:Hours      2
    4             Life:Health      2
    5      Money:Life savings      3
    6         Money:Something      2
    7  Money:Something:Deeper      1
    
    # Averages
                            value
    tag
    Life:Health               2.0
    Life:Work                 3.0
    Money:Life savings        3.0
    Money:Something           2.0
    Money:Something:Deeper    1.0
    Work:Hours                2.5
    Work:Money                2.0
    
    # Tree
    
     +-- Life
         +-- Life:Health (2.0)
         +-- Life:Work (3.0)
             +-- Life:Work:Money (2.0)
             |   +-- Life:Work:Money:Life savings (3.0)
             |   +-- Life:Work:Money:Something (2.0)
             |       +-- Life:Work:Money:Something:Deeper (1.0)