Search code examples
pythonpandasnetworkxgraph-theoryhierarchy

Speed up Networkx performance to create management hierarchy


I have a dataset as below:

  Emp Mgr
0  E1  M1
1  M1  M2
2  M3  M5
3  M2  M5

So for each User(row) I need management hierarchy as:

  Emp Mgr Level_01 Level_02 Level_03 Level_04
0  E1  M1       M5       M2       M1       E1
1  M1  M2       M5       M2       M1
2  M3  M5       M5       M3
3  M2  M5       M5       M2

The output is something like:
Emp > Managers (top level to his direct manager).
So eg: For EmpA: Mgr1 (CEO) - Mgr2 (Director) - M3 (Sr.Manager) - M4 (Emp's direct manager)

I am using networks for it as described in this answer. There are 177K records with 2 root nodes. The total time to generate this hierarchy is more than 6hours. How can I significantly reduce the time taken by the script.

G = nx.from_pandas_edgelist(df, source='Mgr', target='Emp',
                            create_using=nx.DiGraph)

# find roots (= top managers)
roots = [n for n,d in G.in_degree() if d==0]
    
df2 = (pd.DataFrame([next((p for root in roots for p in nx.all_simple_paths(G, root, node)), [])[:-1]
                     for node in df['Emp']], index=df.index)
         .rename(columns=lambda x: f'Level_{x+1:02d}')
      )

Solution

  • I rewrote the logic using a combination of networkx and pure python with a recursive function:

    df = pd.read_csv('demodata2.csv', skiprows=3, usecols=[0, 1])
    
    G = nx.from_pandas_edgelist(df.dropna(subset='MgrUPN'), source='MgrUPN', target='EmpUPN',
                                create_using=nx.DiGraph)
    G.remove_edges_from(nx.selfloop_edges(G))
    
    parent = {}
    
    # uncomment the prints to see which nodes have no or multiple parents
    for n in nx.dfs_postorder_nodes(G):
        p = list(G.predecessors(n))
        if len(p) == 0:
            #print(f'"{n}" has no parent')
            pass
        else:
            if len(p)>1:
                #print(f'"{n}" has multiple parents ({p}), picking "{p[0]}"')
                pass
            parent[n] = p[0]
    
    def get_parents(n):
        try:
            yield from get_parents(parent[n])
        except KeyError:
            pass
        yield n
    
    out = df.join(pd.DataFrame([list(get_parents(node)) for node in df['EmpUPN']], index=df.index)
                    .rename(columns=lambda x: f'Level_{x+1:02d}')
                 )
    print(out)
    

    Output:

                                                      EmpUPN                         MgrUPN                                        Level_01                 Level_02                   Level_03                     Level_04                    Level_05                       Level_06                 Level_07                         Level_08                                          Level_09 Level_10 Level_11
    0                                    2pjwtpk.zp@lB0f.ctm     Mlpkuz.Lpimwpubpp@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm       llul.zlvill@lB0f.ctm      l22llttp2i.p.K@lB0f.ctm    l2jt2it.jtpjtl1@lB0f.ctm     Mlpkuz.Lpimwpubpp@lB0f.ctm      2pjwtpk.zp@lB0f.ctm                             None                                              None     None     None
    1                               Iz2pd_ibm.ctm242@DpF.ctm         zlvipl.Ku2wlp@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm    Flp5lz.j5tbl2i@lB0f.ctm     Mlximp.Dpzbip2z@lB0f.ctm        Blljij.will@lB0f.ctm        Lpp.Klmlppjztz@lB0f.ctm   zlvipl.Ku2wlp@lB0f.ctm         Iz2pd_ibm.ctm242@DpF.ctm                                              None     None     None
    2                                  2Z.zzw.tj5pp@lB0f.ctm      jpppp20fp.lpipiz@lB0f.ctm                          0flj52.5pl2p2@lB0f.ctm  wlp2.zi20fllip@lB0f.ctm  jpppp20fp.lpipiz@lB0f.ctm        2Z.zzw.tj5pp@lB0f.ctm                        None                           None                     None                             None                                              None     None     None
    3                                      IlM01w23@lB0f.ctm  Fpl2jizpk.ll20ft0f5l@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm      Kpiz.Ltvpjt2@lB0f.ctm          pt2.ljki2z@lB0f.ctm      Luklz.tbtpzk2@lB0f.ctm  Fpl2jizpk.ll20ft0f5l@lB0f.ctm        IlM01w23@lB0f.ctm                             None                                              None     None     None
    4                                      zbp2plux@lB0f.ctm          Ml2ppz5.wtpl@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm    Xppxpz.0fttlpp@lB0f.ctm      Lipzbpj.D5tkpp@lB0f.ctm  0f5pizjil2.Kpu2p2@lB0f.ctm          Ml2ppz5.wtpl@lB0f.ctm        zbp2plux@lB0f.ctm                             None                                              None     None     None
    ...                                                  ...                            ...                                             ...                      ...                        ...                          ...                         ...                            ...                      ...                              ...                                               ...      ...      ...
    177920  5idplki_zlijtu_lm.zumijtmtlifp.0ft.jl242@DpF.ctm   2tz5i5ijt.Mijzuzumi@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm    jlklz5i.Upzlkl@lB0f.ctm  zpii0f5ipt5.Kimtjt@lB0f.ctm      Ji2.2tz5iklwl@lB0f.ctm      2tz5ijlkl.2lkl2l@lB0f.ctm  jlkl5ipt.2iz5i@lB0f.ctm     2tz5i5ijt.Mijzuzumi@lB0f.ctm  5idplki_zlijtu_lm.zumijtmtlifp.0ft.jl242@DpF.ctm     None     None
    177921                    zlijtb_p2tmt.0ft.jl242@DpF.ctm                            NaN                  zlijtb_p2tmt.0ft.jl242@DpF.ctm                     None                       None                         None                        None                           None                     None                             None                                              None     None     None
    177922    zlijt.kti0f5i.553_0fl2t2-ijz.0ft.jl242@DpF.ctm                            NaN  zlijt.kti0f5i.553_0fl2t2-ijz.0ft.jl242@DpF.ctm                     None                       None                         None                        None                           None                     None                             None                                              None     None     None
    177923                   zlijtu_2dk2pj.0ft.jl242@DpF.ctm        0f5ijlkl.Kiz5i@lB0f.ctm                                 mlpji2@lB0f.ctm            pll2@lB0f.ctm    jlklz5i.Upzlkl@lB0f.ctm     Mlzl5lpu.Klwlzp@lB0f.ctm    p2ujl.jlkl5lz5i@lB0f.ctm          2ukit.jl2lkl@lB0f.ctm  0f5ijlkl.Kiz5i@lB0f.ctm  zlijtu_2dk2pj.0ft.jl242@DpF.ctm                                              None     None     None
    177924                 pkljltkl_zj2pj.0ft.jl242@lB0f.ctm                            NaN               pkljltkl_zj2pj.0ft.jl242@lB0f.ctm                     None                       None                         None                        None                           None                     None                             None                                              None     None     None
    
    [177925 rows x 13 columns]
    

    Running time for 177K rows:

    1.44 s ± 255 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)