In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros.
Format is like as following:
data attribute1 attribute2 attribute3 attribute4 .........
A 0 1 0 1 .........
B 1 0 1 0 .........
C 1 1 0 1 .........
D 1 1 0 0 .........
What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high-dimensionality. Eeve if I cluster based on those few informative attribute, it's still to many attributes.
I think the decision tree is nice to cluster this data. But it's a Classification algorithm!
What can I do?
Have you considered frequent itemset mining instead?
K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...
Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.
If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.