I have a dataset as below:
Emp Mgr
0 E1 M1
1 M1 M2
2 M3 M5
3 M2 M5
So for each User(row) I need management hierarchy as:
Emp Mgr Level_01 Level_02 Level_03 Level_04
0 E1 M1 M5 M2 M1 E1
1 M1 M2 M5 M2 M1
2 M3 M5 M5 M3
3 M2 M5 M5 M2
The output is something like:
Emp > Managers (top level to his direct manager).
So eg: For EmpA: Mgr1 (CEO) - Mgr2 (Director) - M3 (Sr.Manager) - M4 (Emp's direct manager)
I am using networks for it as described in this answer. There are 177K records with 2 root nodes. The total time to generate this hierarchy is more than 6hours. How can I significantly reduce the time taken by the script.
G = nx.from_pandas_edgelist(df, source='Mgr', target='Emp',
create_using=nx.DiGraph)
# find roots (= top managers)
roots = [n for n,d in G.in_degree() if d==0]
df2 = (pd.DataFrame([next((p for root in roots for p in nx.all_simple_paths(G, root, node)), [])[:-1]
for node in df['Emp']], index=df.index)
.rename(columns=lambda x: f'Level_{x+1:02d}')
)
I rewrote the logic using a combination of networkx
and pure python with a recursive function:
df = pd.read_csv('demodata2.csv', skiprows=3, usecols=[0, 1])
G = nx.from_pandas_edgelist(df.dropna(subset='MgrUPN'), source='MgrUPN', target='EmpUPN',
create_using=nx.DiGraph)
G.remove_edges_from(nx.selfloop_edges(G))
parent = {}
# uncomment the prints to see which nodes have no or multiple parents
for n in nx.dfs_postorder_nodes(G):
p = list(G.predecessors(n))
if len(p) == 0:
#print(f'"{n}" has no parent')
pass
else:
if len(p)>1:
#print(f'"{n}" has multiple parents ({p}), picking "{p[0]}"')
pass
parent[n] = p[0]
def get_parents(n):
try:
yield from get_parents(parent[n])
except KeyError:
pass
yield n
out = df.join(pd.DataFrame([list(get_parents(node)) for node in df['EmpUPN']], index=df.index)
.rename(columns=lambda x: f'Level_{x+1:02d}')
)
print(out)
Output:
EmpUPN MgrUPN Level_01 Level_02 Level_03 Level_04 Level_05 Level_06 Level_07 Level_08 Level_09 Level_10 Level_11
0 2pjwtpk.zp@lB0f.ctm Mlpkuz.Lpimwpubpp@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm llul.zlvill@lB0f.ctm l22llttp2i.p.K@lB0f.ctm l2jt2it.jtpjtl1@lB0f.ctm Mlpkuz.Lpimwpubpp@lB0f.ctm 2pjwtpk.zp@lB0f.ctm None None None None
1 Iz2pd_ibm.ctm242@DpF.ctm zlvipl.Ku2wlp@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm Flp5lz.j5tbl2i@lB0f.ctm Mlximp.Dpzbip2z@lB0f.ctm Blljij.will@lB0f.ctm Lpp.Klmlppjztz@lB0f.ctm zlvipl.Ku2wlp@lB0f.ctm Iz2pd_ibm.ctm242@DpF.ctm None None None
2 2Z.zzw.tj5pp@lB0f.ctm jpppp20fp.lpipiz@lB0f.ctm 0flj52.5pl2p2@lB0f.ctm wlp2.zi20fllip@lB0f.ctm jpppp20fp.lpipiz@lB0f.ctm 2Z.zzw.tj5pp@lB0f.ctm None None None None None None None
3 IlM01w23@lB0f.ctm Fpl2jizpk.ll20ft0f5l@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm Kpiz.Ltvpjt2@lB0f.ctm pt2.ljki2z@lB0f.ctm Luklz.tbtpzk2@lB0f.ctm Fpl2jizpk.ll20ft0f5l@lB0f.ctm IlM01w23@lB0f.ctm None None None None
4 zbp2plux@lB0f.ctm Ml2ppz5.wtpl@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm Xppxpz.0fttlpp@lB0f.ctm Lipzbpj.D5tkpp@lB0f.ctm 0f5pizjil2.Kpu2p2@lB0f.ctm Ml2ppz5.wtpl@lB0f.ctm zbp2plux@lB0f.ctm None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ...
177920 5idplki_zlijtu_lm.zumijtmtlifp.0ft.jl242@DpF.ctm 2tz5i5ijt.Mijzuzumi@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm jlklz5i.Upzlkl@lB0f.ctm zpii0f5ipt5.Kimtjt@lB0f.ctm Ji2.2tz5iklwl@lB0f.ctm 2tz5ijlkl.2lkl2l@lB0f.ctm jlkl5ipt.2iz5i@lB0f.ctm 2tz5i5ijt.Mijzuzumi@lB0f.ctm 5idplki_zlijtu_lm.zumijtmtlifp.0ft.jl242@DpF.ctm None None
177921 zlijtb_p2tmt.0ft.jl242@DpF.ctm NaN zlijtb_p2tmt.0ft.jl242@DpF.ctm None None None None None None None None None None
177922 zlijt.kti0f5i.553_0fl2t2-ijz.0ft.jl242@DpF.ctm NaN zlijt.kti0f5i.553_0fl2t2-ijz.0ft.jl242@DpF.ctm None None None None None None None None None None
177923 zlijtu_2dk2pj.0ft.jl242@DpF.ctm 0f5ijlkl.Kiz5i@lB0f.ctm mlpji2@lB0f.ctm pll2@lB0f.ctm jlklz5i.Upzlkl@lB0f.ctm Mlzl5lpu.Klwlzp@lB0f.ctm p2ujl.jlkl5lz5i@lB0f.ctm 2ukit.jl2lkl@lB0f.ctm 0f5ijlkl.Kiz5i@lB0f.ctm zlijtu_2dk2pj.0ft.jl242@DpF.ctm None None None
177924 pkljltkl_zj2pj.0ft.jl242@lB0f.ctm NaN pkljltkl_zj2pj.0ft.jl242@lB0f.ctm None None None None None None None None None None
[177925 rows x 13 columns]
Running time for 177K rows:
1.44 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)