I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.
I read that it calculates modes for such attributes. I want to know how the similarity is calculated.
Lets take an example: Consider a dataset with 3 numeric and a nomimal attribute. The nominal attribute has 3 values: A, B and C.
Instance1 has value A, Instance2 has value B and Instance3 has value A. In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?
Follow up: What if the nominal attribute has more(10) possible values?
You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C
. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.
If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.