machine-learning data-mining cluster-analysis

How to cluster data with discrete binary attributes?

In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros.

Format is like as following:

data  attribute1 attribute2 attribute3 attribute4   .........
A          0          1           0         1       .........
B          1          0           1         0       .........
C          1          1           0         1       .........
D          1          1           0         0       .........

What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high-dimensionality. Eeve if I cluster based on those few informative attribute, it's still to many attributes.

I think the decision tree is nice to cluster this data. But it's a Classification algorithm!

What can I do?

Solution

Have you considered frequent itemset mining instead?

K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...

Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.

If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.