Search code examples
machine-learningscikit-learncluster-analysis

Clustering to achieve heterogeneous groups


I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?


Solution

  • I don't believe you need a clustering algorithm to group the data based upon a categorical variable.

    Based on you question, I think this should work.

    # Code
    from sklearn.model_selection import train_test_split
    
    group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
    group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])
    

    Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.

    # Sample output
    
    print(data)
       val1  val2 lab
    0     1     1   L
    1     2     2   L
    2     3     3   L
    3     4     4   M
    4     5     5   M
    5     6     6   M
    6     7     7   H
    7     8     8   H
    8     9     9   H
    
    print(group1)
       val1  val2 lab
    4     5     5   M
    1     2     2   L
    6     7     7   H
    
    print(group2)
       val1  val2 lab
    8     9     9   H
    2     3     3   L
    3     4     4   M
    
    print(group3)
       val1  val2 lab
    0     1     1   L
    7     8     8   H
    5     6     6   M
    

    train_test_split() Documentation