python pandas dataframe automation data-science

How to count people who are below a position

I'm looking to count how many people are below a given user of the data frame.

Employee	Manager
A	-
B	A
C	A
D	A
E	A
F	B
G	B
H	C
I	C

I would like to get in the output: I, H, G, F, E and D have no employees below C has two employees (H and I) below it B has two employees (F and G) A has eight employees below him (B, C, D and E plus the employees of B and C)

Would anyone have any suggestions? In my DF I have more hierarchy layers and a very large amount of data.

I thought about storing it in a dictionary and doing a loop to update it, but I believe that this solution is not efficient at all. I would like to know if there is any more efficient technique to solve this type of problem.

Solution

I would use a directed graph with networkx. This is a super fun python package.

import networkx as nx, pandas as pd

#set up data
employee = ['A', 'B', 'C','D','E','F','G','H','I']
manager = ['', 'A', 'A','A','A','B','B','C','C']
relations = pd.DataFrame(list(zip(employee,manager)), columns = ['Employee', 'Manager'])

# If there is no manager, make it the employee
relations.Manager = np.where(relations.Manager == '', relations.Employee, relations.Manager)
# or might need depending on data format:
relations.Manager = np.where(relations.Manager.isna(), relations.Employee, relations.Manager)

# Create tuples for 'edges'
relations['edge'] = list(zip(relations.Manager, relations.Employee))

# Create graph
G = nx.DiGraph()
G.add_nodes_from(list(relations.Employee))
G.add_edges_from(list(set(relations.edge)))

#Find all the descendants of nodes/employees
relations['employees_below'] = relations.apply(lambda row: nx.descendants(G,row.Employee), axis = 1)

returns:

  Employee Manager    edge           employees_below
0        A       A  (A, A)  {C, G, I, D, H, F, E, B}
1        B       A  (A, B)                    {F, G}
2        C       A  (A, C)                    {H, I}
3        D       A  (A, D)                        {}
4        E       A  (A, E)                        {}
5        F       B  (B, F)                        {}
6        G       B  (B, G)                        {}
7        H       C  (C, H)                        {}
8        I       C  (C, I)                        {}

The way it works: graphs are nodes and edges. In this case, your nodes are employees and your edges are a relationship between a manager and an employee. Do a quick google for 'networkx directed graph' images and you'll get the idea of what this looks like in an image representation.

Make sure your data is cleaned up where everyone has a manager (make it themselves if there is none, for example)
First, create your edges in the form of a tuple of (manager, employee) and save it somewhere (I chose to make it a column in the df called edges).
Next, make a directed graph in networkx. A directed graph is needed due to the hierarchical relationship. this means that relationships work down from manager to employee. So, in this case, each edge goes in a direction from manager to employee.
Add every employee to your graph as a 'node'.
Add every employee-manager relationship to your graph as an edge, using the pre-defined tuples of (manager, employee) discussed previously.
Lastly, you can get the output of an employee's subordinates by finding all this node's descendants. Descendants are all nodes (ie, employees) that can be reached from a node (ie, employee). I chose to assign this to a column and apply the descendants function to the employee in each row with apply.