Search code examples
pythonpandasperformancenetworkxsubgraph

How to attribute to each elements of a sublist sharing comment elements the unique ID of the related sublist?


Out of an list of ~500k lines made of pairs of items, I'm trying to build up a file that aims to allocate for each item an ID related to the group they belong to. Further explanations follow.

And I would need some help to get the result in a smart and efficient way (that is pythonic)

==============

what I want to do

Transform the input file df0 into the desired output df2

For instance, the starting file would look like this (but with 500k entries), where a relation from item1 to item2 is determined by the structure of the dataframe.

df0 : input

df0 = pd.DataFrame({
"item 1": ['Q', 'R', 'B', 'A'],
"item 2": ['R', 'P', 'A', 'C']
})

It reads as following: item Q is related to item R, and item R is related to item P, hence item Q is related to item P (same with with A, B and C). In that case, the transitivity of relationships leads to build two groups of items.

  • Each item belongs to only 1 group.
  • It is expected in the real case file that groups can hold up to 11 items.
  • the alphabetic order plays no role

Thanks to other contributions on stackoverflow, I managed to group all transitive items into single sets and allocate them a single group number (or ID). Meaning I get a dataframe that looks like that:

df1 = pd.DataFrame({
"items": [{'Q', 'R', 'P'}, {'B', 'A', 'C'} ],
"group": [1, 2]
})

The result above is now to be transformed to support further data post-treatments and the desired outcome should look like this:

df2 : desired output

df2 = pd.DataFrame({
"items": ['Q', 'R', 'P', 'B', 'A', 'C' ],
"group": [1, 1, 1, 2, 2, 2 ]
})

==============

What I managed so far

step 1 : convert df1.item into a series of single items

d = df1.item
e = list(sorted(set(chain.from_iterable(d))))
df2 = pd.DataFrame({'item':e})

step 2 : 'vlookup' df2.items back to df1.group via df1.items

df2['group'] = ''  

n = 0
for row in df2.items :
m = 0
for row in df1.items :
    if df2['items'][n] in df1['items'][m]:
        df2['group'][n] = df1['group'][m]
    m = m + 1
n = n + 1

==============

It does work for small tables, but is not workable on large dataframes.

I'm now looking for assistance regarding :

  • either a better code for step 2 to enhance df2 (preferred)
  • or a better way to jump over step 2 and get df2 straight out of df1

Thank you very in advance for your time and feedback !


Solution

  • IIUC, you might try looking at the networkx library.

    You can create an undirect network graph directly from your pandas.DataFrame and use the connected_component_subgraphs method to extract the subgroups:

    import networkx as nx
    
    df0 = pd.DataFrame({'item 1': {0: 'Q', 1: 'R', 2: 'B', 3: 'A'},
                        'item 2': {0: 'R', 1: 'P', 2: 'A', 3: 'C'}})
    
    g = nx.convert_matrix.from_pandas_edgelist(df0, source='item 1', target='item 2')
    

    Use list comprehension to create data for your new DataFrame

    subgroups = [(n, i + 1) for i, sg in enumerate(nx.connected_component_subgraphs(g)) for n in sg.nodes]
    
    df2 = pd.DataFrame(subgroups, columns=['items', 'subgroup'])
    print(df2)
    
      items  subgroup
    0     P         1
    1     R         1
    2     Q         1
    3     C         2
    4     A         2
    5     B         2