Cluster analysis of large dataset containing only categorical variables

I have been given the task of clustering our customers base on products they bought together. My data contains 500,000 rows related to each customer and 8,000 variables (product ids). Each variable is a one hot encode vector that shows if a customer bought that product or not.

I have tried to reduce the dimensions of my data with MCA (multiple correspondence algorithm) and then use k-means and dbscan for cluster analysis, but my results were not satisfying.

What are some proper algorithms for cluster analysis of large datasets with high dimensions and their python implementation?

Solution

Instead of clustering, what you should likely be using is frequent pattern mining.

One-hot encoding variables often does more harm than good. Either use a well-chosen distance for such data (could be as simple as Hamming or Jaccard on some data sets) with a suitable clustering algorithm (e.g., hierarchical, DBSCAN, but not k-means). Alternatively, try k-modes. But most likely, frequent itemsets is the more meaningful analysis onnsuvh data.