Search code examples

Group dataframe rows with common vales

I have the following dataset:

    0   1   2   3
0   a   ❤   💛  👍
1   b   ❤   👍  🙏
2   c   😉  🙏  👍
3   d   😉  ✨   💪
4   e   ❤   😉  🙏

I would like to perform clustering to group the ROWS which have something in common.

By using networkx in the following code, this is the result:

import networkx as nx
import matplotlib.pyplot as plt

G=nx.from_pandas_edgelist(df, 0, 1)
nx.draw(G, with_labels=True)

output: groups obtained with networkx

How can I also consider columns 2 and 3? Can I also do it without giving any priority to any particular column (example, I want column 2 to be equally important as column 1)?


  • Similarly to this answer, you could have each dataframe raw be a path, and look for the connected components. I've added a row without any common values with any other rows to better illustrate how this works:

       0  1   2    3
    0  a  ❤  💛  👍
    1  b  ❤  👍  🙏
    2  c  😉  🙏  👍
    3  d  😉  ✨  💪
    4  e  ❤  😉  🙏
    5  f  👅  😱  🤑

    So iterate over the dataframe rows, and add them as paths with nx.add_path:

    my_list = df.values.tolist()
    for path in my_list:
        nx.add_path(G, path)
    components = list(nx.connected_components(G))
    [{'a', 'b', 'c', 'd', 'e', '✨', '❤', '👍', '💛', '💪', '😉', '🙏'},
     {'f', '👅', '😱', '🤑'}]

    And now you can traverse the groups, and add each row to a new sublist in a nested list if it is a subset of the component:

    groups = []
    for component in components:
        group = []
        for path in my_list:
            if component.issuperset(path):

    In this case you'd have all rows except for the last grouped together, and the last in another gruop.

    [[['a', '❤', '💛', '👍'],
      ['b', '❤', '👍', '🙏'],
      ['c', '😉', '🙏', '👍'],
      ['d', '😉', '✨', '💪'],
      ['e', '❤', '😉', '🙏']],
     [['f', '👅', '😱', '🤑']]]