Search code examples
pythonjoininner-join

Perform inner_join() of R in Python


I have a Pandas database Network with a network structure like this:

{'Sup': {0: 1002000157,
  1: 1002000157,
  2: 1002000157,
  3: 1002000157,
  4: 1002000157,
  5: 1002000157,
  6: 1002000157,
  7: 1002000157,
  8: 1002000157,
  9: 1002000157,
  10: 1002000157,
  11: 1002000157,
  12: 1002000157,
  13: 1002000382,
  14: 1002000382,
  15: 1002000382,
  16: 1002000382,
  17: 1002000382,
  18: 1002000382,
  19: 1002000382,
  20: 1002000382,
  21: 1002000382,
  22: 1002000382,
  23: 1002000382,
  24: 1002000382,
  25: 1002000382,
  26: 1002000382,
  27: 1002000382,
  28: 1002000382,
  29: 1002000382},
 'Cust': {0: 1002438313,
  1: 8039296054,
  2: 9003188096,
  3: 14900070991,
  4: 17005234747,
  5: 18006860724,
  6: 28000286091,
  7: 29009623382,
  8: 39000007702,
  9: 39004420023,
  10: 46000088397,
  11: 50000063751,
  12: 7000090017,
  13: 1900120936,
  14: 1900779883,
  15: 2000013994,
  16: 2001222824,
  17: 2003032125,
  18: 2900121723,
  19: 2900197555,
  20: 2902742641,
  21: 3000101113,
  22: 3000195031,
  23: 3000318054,
  24: 3900091301,
  25: 3911084436,
  26: 4900112325,
  27: 5900720933,
  28: 7000001703,
  29: 8000004881}}

I would like to reproduce this R command (possibly without kernel interrupting) in Python:

NodesSharingSupplier <- inner_join(Network, Network,  by=c('Sup'='Sup'))

This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.

How do I reproduce it in Python?


Solution

  • IIUC, you are looking for merge:

    NodesSharingSupplier = Network.merge(Network, on='Sup', how='inner')
    print(NodesSharingSupplier)
    
    # Output
                Sup      Cust_x       Cust_y
    0    1002000157  1002438313   1002438313
    1    1002000157  1002438313   8039296054
    2    1002000157  1002438313   9003188096
    3    1002000157  1002438313  14900070991
    4    1002000157  1002438313  17005234747
    ..          ...         ...          ...
    453  1002000382  8000004881   3911084436
    454  1002000382  8000004881   4900112325
    455  1002000382  8000004881   5900720933
    456  1002000382  8000004881   7000001703
    457  1002000382  8000004881   8000004881
    
    [458 rows x 3 columns]
    

    You can remove case where Cust_x == Cust_y by appending .query('Cust_x != Cust_y') after .merge(...).

    Input:

    data = {'Sup': {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
            'Cust': {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
    Network = pd.DataFrame(data)
    

    More information: Pandas Merging 101

    Update

    The kernel dies after it. Consider that the Network database has 5 million observations

    Use dask:

    import dask.pandas as dd
    
    NetworkDD = dd.from_pandas(Network, npartitions=Network['Sup'].nunique())
    NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='Sup').compute()