Search code examples
cluster-analysiswekak-means

Weka Simple K means handling nominal attributes


I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.

I read that it calculates modes for such attributes. I want to know how the similarity is calculated.

Lets take an example: Consider a dataset with 3 numeric and a nomimal attribute. The nominal attribute has 3 values: A, B and C.

Instance1 has value A, Instance2 has value B and Instance3 has value A. In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?

Follow up: What if the nominal attribute has more(10) possible values?


Solution

  • You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.

    If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.