python algorithm cluster-analysis bioinformatics k-means

Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation.

I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to use it in my case.

thanks!

Solution

K-means doesn't really care about the type of the data involved. All you need to do a K-means is some way to measure a "distance" from one item to another. It'll do its thing based on the distances, regardless of how that happens to be computed from the underlying data.

That said, I haven't used scipy.cluster.vq, so I'm not sure exactly how you tell it the relationship between items, or how to compute a distance from item A to item B.