Out of an list of ~500k lines made of pairs of items, I'm trying to build up a file that aims to allocate for each item an ID related to the group they belong to. Further explanations follow.
And I would need some help to get the result in a smart and efficient way (that is pythonic)
==============
Transform the input file df0 into the desired output df2
For instance, the starting file would look like this (but with 500k entries), where a relation from item1 to item2 is determined by the structure of the dataframe.
df0 : input
df0 = pd.DataFrame({
"item 1": ['Q', 'R', 'B', 'A'],
"item 2": ['R', 'P', 'A', 'C']
})
It reads as following: item Q is related to item R, and item R is related to item P, hence item Q is related to item P (same with with A, B and C). In that case, the transitivity of relationships leads to build two groups of items.
Thanks to other contributions on stackoverflow, I managed to group all transitive items into single sets and allocate them a single group number (or ID). Meaning I get a dataframe that looks like that:
df1 = pd.DataFrame({
"items": [{'Q', 'R', 'P'}, {'B', 'A', 'C'} ],
"group": [1, 2]
})
The result above is now to be transformed to support further data post-treatments and the desired outcome should look like this:
df2 : desired output
df2 = pd.DataFrame({
"items": ['Q', 'R', 'P', 'B', 'A', 'C' ],
"group": [1, 1, 1, 2, 2, 2 ]
})
==============
step 1 : convert df1.item into a series of single items
d = df1.item
e = list(sorted(set(chain.from_iterable(d))))
df2 = pd.DataFrame({'item':e})
step 2 : 'vlookup' df2.items back to df1.group via df1.items
df2['group'] = ''
n = 0
for row in df2.items :
m = 0
for row in df1.items :
if df2['items'][n] in df1['items'][m]:
df2['group'][n] = df1['group'][m]
m = m + 1
n = n + 1
==============
It does work for small tables, but is not workable on large dataframes.
I'm now looking for assistance regarding :
Thank you very in advance for your time and feedback !
IIUC, you might try looking at the networkx
library.
You can create an undirect network graph directly from your pandas.DataFrame
and use the connected_component_subgraphs
method to extract the subgroups:
import networkx as nx
df0 = pd.DataFrame({'item 1': {0: 'Q', 1: 'R', 2: 'B', 3: 'A'},
'item 2': {0: 'R', 1: 'P', 2: 'A', 3: 'C'}})
g = nx.convert_matrix.from_pandas_edgelist(df0, source='item 1', target='item 2')
Use list comprehension to create data for your new DataFrame
subgroups = [(n, i + 1) for i, sg in enumerate(nx.connected_component_subgraphs(g)) for n in sg.nodes]
df2 = pd.DataFrame(subgroups, columns=['items', 'subgroup'])
print(df2)
items subgroup
0 P 1
1 R 1
2 Q 1
3 C 2
4 A 2
5 B 2