machine-learning scikit-learn cluster-analysis

Clustering to achieve heterogeneous groups

I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?

Solution

I don't believe you need a clustering algorithm to group the data based upon a categorical variable.

Based on you question, I think this should work.

# Code
from sklearn.model_selection import train_test_split

group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])

Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.

# Sample output

print(data)
   val1  val2 lab
0     1     1   L
1     2     2   L
2     3     3   L
3     4     4   M
4     5     5   M
5     6     6   M
6     7     7   H
7     8     8   H
8     9     9   H

print(group1)
   val1  val2 lab
4     5     5   M
1     2     2   L
6     7     7   H

print(group2)
   val1  val2 lab
8     9     9   H
2     3     3   L
3     4     4   M

print(group3)
   val1  val2 lab
0     1     1   L
7     8     8   H
5     6     6   M

train_test_split() Documentation