I have a Pandas database Network with a network structure like this:
{'Sup': {0: 1002000157,
1: 1002000157,
2: 1002000157,
3: 1002000157,
4: 1002000157,
5: 1002000157,
6: 1002000157,
7: 1002000157,
8: 1002000157,
9: 1002000157,
10: 1002000157,
11: 1002000157,
12: 1002000157,
13: 1002000382,
14: 1002000382,
15: 1002000382,
16: 1002000382,
17: 1002000382,
18: 1002000382,
19: 1002000382,
20: 1002000382,
21: 1002000382,
22: 1002000382,
23: 1002000382,
24: 1002000382,
25: 1002000382,
26: 1002000382,
27: 1002000382,
28: 1002000382,
29: 1002000382},
'Cust': {0: 1002438313,
1: 8039296054,
2: 9003188096,
3: 14900070991,
4: 17005234747,
5: 18006860724,
6: 28000286091,
7: 29009623382,
8: 39000007702,
9: 39004420023,
10: 46000088397,
11: 50000063751,
12: 7000090017,
13: 1900120936,
14: 1900779883,
15: 2000013994,
16: 2001222824,
17: 2003032125,
18: 2900121723,
19: 2900197555,
20: 2902742641,
21: 3000101113,
22: 3000195031,
23: 3000318054,
24: 3900091301,
25: 3911084436,
26: 4900112325,
27: 5900720933,
28: 7000001703,
29: 8000004881}}
I would like to reproduce this R command (possibly without kernel interrupting) in Python:
NodesSharingSupplier <- inner_join(Network, Network, by=c('Sup'='Sup'))
This is an SQL-style inner join, thus I fear that it cannot be performed simply with an inner merge on Sup in Python.
How do I reproduce it in Python?
IIUC, you are looking for merge
:
NodesSharingSupplier = Network.merge(Network, on='Sup', how='inner')
print(NodesSharingSupplier)
# Output
Sup Cust_x Cust_y
0 1002000157 1002438313 1002438313
1 1002000157 1002438313 8039296054
2 1002000157 1002438313 9003188096
3 1002000157 1002438313 14900070991
4 1002000157 1002438313 17005234747
.. ... ... ...
453 1002000382 8000004881 3911084436
454 1002000382 8000004881 4900112325
455 1002000382 8000004881 5900720933
456 1002000382 8000004881 7000001703
457 1002000382 8000004881 8000004881
[458 rows x 3 columns]
You can remove case where Cust_x == Cust_y
by appending .query('Cust_x != Cust_y')
after .merge(...)
.
Input:
data = {'Sup': {0: 1002000157, 1: 1002000157, 2: 1002000157, 3: 1002000157, 4: 1002000157, 5: 1002000157, 6: 1002000157, 7: 1002000157, 8: 1002000157, 9: 1002000157, 10: 1002000157, 11: 1002000157, 12: 1002000157, 13: 1002000382, 14: 1002000382, 15: 1002000382, 16: 1002000382, 17: 1002000382, 18: 1002000382, 19: 1002000382, 20: 1002000382, 21: 1002000382, 22: 1002000382, 23: 1002000382, 24: 1002000382, 25: 1002000382, 26: 1002000382, 27: 1002000382, 28: 1002000382, 29: 1002000382},
'Cust': {0: 1002438313, 1: 8039296054, 2: 9003188096, 3: 14900070991, 4: 17005234747, 5: 18006860724, 6: 28000286091, 7: 29009623382, 8: 39000007702, 9: 39004420023, 10: 46000088397, 11: 50000063751, 12: 7000090017, 13: 1900120936, 14: 1900779883, 15: 2000013994, 16: 2001222824, 17: 2003032125, 18: 2900121723, 19: 2900197555, 20: 2902742641, 21: 3000101113, 22: 3000195031, 23: 3000318054, 24: 3900091301, 25: 3911084436, 26: 4900112325, 27: 5900720933, 28: 7000001703, 29: 8000004881}}
Network = pd.DataFrame(data)
More information: Pandas Merging 101
Update
The kernel dies after it. Consider that the Network database has 5 million observations
Use dask
:
import dask.pandas as dd
NetworkDD = dd.from_pandas(Network, npartitions=Network['Sup'].nunique())
NodesSharingSupplier = dd.merge(NetworkDD, NetworkDD, on='Sup').compute()