I have the following dataset:
0 1 2 3
0 a ❤ 💛 👍
1 b ❤ 👍 🙏
2 c 😉 🙏 👍
3 d 😉 ✨ 💪
4 e ❤ 😉 🙏
I would like to perform clustering to group the ROWS which have something in common.
By using networkx in the following code, this is the result:
import networkx as nx
import matplotlib.pyplot as plt
G=nx.from_pandas_edgelist(df, 0, 1)
nx.draw(G, with_labels=True)
plt.show()
output: groups obtained with networkx
How can I also consider columns 2 and 3? Can I also do it without giving any priority to any particular column (example, I want column 2 to be equally important as column 1)?
Similarly to this answer, you could have each dataframe raw be a path, and look for the connected components. I've added a row without any common values with any other rows to better illustrate how this works:
print(df)
0 1 2 3
0 a ❤ 💛 👍
1 b ❤ 👍 🙏
2 c 😉 🙏 👍
3 d 😉 ✨ 💪
4 e ❤ 😉 🙏
5 f 👅 😱 🤑
So iterate over the dataframe rows, and add them as paths with nx.add_path
:
my_list = df.values.tolist()
G=nx.Graph()
for path in my_list:
nx.add_path(G, path)
components = list(nx.connected_components(G))
print(components)
[{'a', 'b', 'c', 'd', 'e', '✨', '❤', '👍', '💛', '💪', '😉', '🙏'},
{'f', '👅', '😱', '🤑'}]
And now you can traverse the groups, and add each row to a new sublist in a nested list if it is a subset of the component:
groups = []
for component in components:
group = []
for path in my_list:
if component.issuperset(path):
group.append(path)
groups.append(group)
In this case you'd have all rows except for the last grouped together, and the last in another gruop.
print(groups)
[[['a', '❤', '💛', '👍'],
['b', '❤', '👍', '🙏'],
['c', '😉', '🙏', '👍'],
['d', '😉', '✨', '💪'],
['e', '❤', '😉', '🙏']],
[['f', '👅', '😱', '🤑']]]