Search code examples
pythonpandasdataframeautomationdata-science

How to count people who are below a position


I'm looking to count how many people are below a given user of the data frame.

Employee Manager
A -
B A
C A
D A
E A
F B
G B
H C
I C

I would like to get in the output: I, H, G, F, E and D have no employees below C has two employees (H and I) below it B has two employees (F and G) A has eight employees below him (B, C, D and E plus the employees of B and C)

Would anyone have any suggestions? In my DF I have more hierarchy layers and a very large amount of data.

I thought about storing it in a dictionary and doing a loop to update it, but I believe that this solution is not efficient at all. I would like to know if there is any more efficient technique to solve this type of problem.


Solution

  • I would use a directed graph with networkx. This is a super fun python package.

    import networkx as nx, pandas as pd
    
    #set up data
    employee = ['A', 'B', 'C','D','E','F','G','H','I']
    manager = ['', 'A', 'A','A','A','B','B','C','C']
    relations = pd.DataFrame(list(zip(employee,manager)), columns = ['Employee', 'Manager'])
    
    # If there is no manager, make it the employee
    relations.Manager = np.where(relations.Manager == '', relations.Employee, relations.Manager)
    # or might need depending on data format:
    relations.Manager = np.where(relations.Manager.isna(), relations.Employee, relations.Manager)
    
    # Create tuples for 'edges'
    relations['edge'] = list(zip(relations.Manager, relations.Employee))
    
    # Create graph
    G = nx.DiGraph()
    G.add_nodes_from(list(relations.Employee))
    G.add_edges_from(list(set(relations.edge)))
    
    #Find all the descendants of nodes/employees
    relations['employees_below'] = relations.apply(lambda row: nx.descendants(G,row.Employee), axis = 1)
    

    returns:

      Employee Manager    edge           employees_below
    0        A       A  (A, A)  {C, G, I, D, H, F, E, B}
    1        B       A  (A, B)                    {F, G}
    2        C       A  (A, C)                    {H, I}
    3        D       A  (A, D)                        {}
    4        E       A  (A, E)                        {}
    5        F       B  (B, F)                        {}
    6        G       B  (B, G)                        {}
    7        H       C  (C, H)                        {}
    8        I       C  (C, I)                        {}
    

    The way it works: graphs are nodes and edges. In this case, your nodes are employees and your edges are a relationship between a manager and an employee. Do a quick google for 'networkx directed graph' images and you'll get the idea of what this looks like in an image representation.

    • Make sure your data is cleaned up where everyone has a manager (make it themselves if there is none, for example)
    • First, create your edges in the form of a tuple of (manager, employee) and save it somewhere (I chose to make it a column in the df called edges).
    • Next, make a directed graph in networkx. A directed graph is needed due to the hierarchical relationship. this means that relationships work down from manager to employee. So, in this case, each edge goes in a direction from manager to employee.
    • Add every employee to your graph as a 'node'.
    • Add every employee-manager relationship to your graph as an edge, using the pre-defined tuples of (manager, employee) discussed previously.
    • Lastly, you can get the output of an employee's subordinates by finding all this node's descendants. Descendants are all nodes (ie, employees) that can be reached from a node (ie, employee). I chose to assign this to a column and apply the descendants function to the employee in each row with apply.