Search code examples
pythoncluster-analysis

Best way for clustering strings using python


I have 3200000 strings (all possible combinations of an alphabet = 'ACDEFGHIKLMNPQRSTVWY'). A small example of the data:

['AAAAA', 'AAAAC', 'AAAAD', 'AAAAE', 'AAAAF', 'AAAAG', 'AAAAH',
 'AAAAI', 'AAAAK', 'AAAAL', 'AAAAM', 'AAAAN', 'AAAAP', 'AAAAQ',
 'AAAAR', 'AAAAS', 'AAAAT', 'AAAAV', 'AAAAW', 'AAAAY', 'AAACA',
 'AAACC', 'AAACD', 'AAACE', 'AAACF', 'AAACG', 'AAACH', 'AAACI',
 'AAACK', 'AAACL', 'AAACM', 'AAACN', 'AAACP', 'AAACQ', 'AAACR',
 'AAACS', 'AAACT', 'AAACV', 'AAACW', 'AAACY'...]

I use this code to construct the big list of small strings.

def get_all_possible_kmers(alphabet, k):
    return [''.join(char) for char in itertools.product(alphabet, repeat=k)]

My intentions are to cluster all strings in groups of shared/similar character compostions, i.e. those that are permutations of a single string. there would have different groups like: a1b1c1d1e1 , a1b1c1d2 , a1b2c2 .... a5 . Each cluster would have to include compositions as a1b4, b1a4, a1c4 , etc. So, each cluster would include all strings that are permutations of a given string composition, e.g. abbbb, babbb, etc. There are any python tool I can use to make this kind of clustering? I was thinking using some kind of distance metric like hamming distance or levenshtein distance.

Do you guys have any direction that would be appropriate for start?

Thank you by your time and knowledge.

Paulo


Solution

  • You can use collections.Counter to generate a cluster hash and update a set in a dictionary.

    For example:

    from collections import Counter, defaultdict
    
    clusters = defaultdict(set)
    for item in get_all_possible_kmers(alphabet, k):
        clusters[str(Counter(item))].add(item)
    
    

    You can also format the str(Counter(item)) to look like you need (a1b4...)

    Also, you can simplify the calculation by updating the clusters when you get the next kmer.