I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. Now I wish to apply hierarchical clustering on it. I found this code:
import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2) # 100 2-dimensional observations
d = sch.distance.pdist(X) # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')
However, X in above code is numeric; I have categorical data.
Is there some way that I can use a numarray of categorical data to find the distance?
In other words can I use categorical data of string values to find the distance?
I would then use that distance in sch.linkage(d, method='complete')
I think we've identified the problem, then: you leave the X
values as they are, string data. You can pass those to pdist
, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.
The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with
d = sch.distance.pdist(X, lambda u, v: u != v)
If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist
. We can't help with that, because you've told us nothing about your classes or the model semantics.
Does that get you moving?