python hierarchical-clustering categorical-data

Hierarchical clustering for categorical data in python

I have a categorical attributes that contains string values. three of them contains dayname(mon---sun) monthname and time interval(morning afternoon evening), the other two as i mentioned before has district and street names. followed by gender ,role, comments(it is a predefined fixed field that have values as good, bad strong agree etc)surname and first name.my intention is to cluster them and visualize it. I applied k-mean clustering using this WEKA bur it did not work. Now I wish to apply hierarchical clustering on it. I found this code:

import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2)     # 100 2-dimensional observations
d = sch.distance.pdist(X)   # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')

However, X in above code is numeric; I have categorical data. Is there some way that I can use a numarray of categorical data to find the distance? In other words can I use categorical data of string values to find the distance? I would then use that distance in sch.linkage(d, method='complete')

Solution

I think we've identified the problem, then: you leave the X values as they are, string data. You can pass those to pdist, but you also have to supply a 2-arity function (2 inputs, numeric output) for the distance metric.

The simplest one would be that equal classifications have 0 distance; everything else is 1. You can do this with

d = sch.distance.pdist(X, lambda u, v: u != v)

If you have other class discrimination in mind, just code logic to return the desired distance, wrap it in a function, and then pass the function name to pdist. We can't help with that, because you've told us nothing about your classes or the model semantics.

Does that get you moving?