Search code examples
cluster-analysisk-meansrapidminer

Clustering Textentities with Rapidminer


I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ...

i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example:

word e is with Cloudtag A and B but not C ... so e is a good seperator to get 2 clusters.

Now there are like 100.000 cloudtags and 1.000.000 words. and i want to do the same to get like K cluster. A cloudtag can belong to two clusters, that is not that important.

I know k-means, but i dont know how to transform the data into numerical multi dimensional data. As far as i know kmeans needs numerical points to create clusters.

I also would like to use rapid miner as a software, but any algorithm, software would be quite useful as a basic input.

Thanks in advance.


Solution

  • You don't describe clustering.

    But feature (word) selection for "cloud tag" classification.

    Have a look at decision trees, and the metrics used there to identify good features for splitting.