Search code examples
python-3.xcluster-analysispattern-recognition

Clustering of a list of strings of letters according to 2 (and ideally generalized to n) arbitrary grouping rules?


I want to sort a set of strings (of letters) of variable length in n groups, based on inclusion of any/all/none letters of n given sets.

For instance, here I am trying to sort all combinations of the letters 'A,B,P,Q,X' in 2 groups, with the following rules: group1 must include all/any of 'A,P' (but not 'B,Q'), group2 must include all/any of 'B,Q' (but not 'A,P'). My final goal is to build a list that has the groups as segregated as possible (e.g. beginning and end), with strings containing no members of any group in the middle, followed by members of both groups and hybrids between the middle and the extremes. Ideally the order would be: all-1/none-2,some-1/none-2,all-1/some-2,none-1-2/some-1-2,all-2/some-1,some-2/none-1,all-2/none-1.

labels_powerset = ['A','B','P','Q','X',
    'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
    'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
    'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']

labels_for_order = []

for length in range(1,len(max(labels_powerset,key=len))+1):
    order = [label for label in labels_powerset if len(label)==length]
    labels_for_order.append(order)

group1 = ['A','P']
group2 = ['B','Q']

all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
        for order in labels_for_order]if y]

both1 = [both[x] for x in range(0,int(len(both)/2))]

both2 = [both[x] for x in range(int(len(both)/2),len(both))]

sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)

The objective is to have a list as symmetric as possible in terms of membership and length of the strings.

I am pretty new at coding and have read something on k-means but can't figure out how to apply it to strings of letters.

How do I do it more efficiently, and in a way that is generalizable to n groups/rules?


Solution

  • K-means is for multivariate continuous data, and clustering does not attempt to make balanced groups.

    What you should consider is to use sorting.

    Define a score function. For example, give +1 for each "good" letter, -1 for each "bad" letter, and a bonus of +-100 if it is pure.

    Then sort the words based on this score.