r csv hierarchical-clustering euclidean-distance conceptual

R: Clustering customers based on similar product interests for an event

I have a dataset with a list of customers and their product preferences. Basically, it is a simple CSV with a column called "CUSTOMER" and 5 other columns called "PRODUCT_WANTED_A", "PRODUCT_WANTED_B" and so on.

I asked these customers if they were interested to know more about a particular product, and answers could be simply YES or NO (1 or 0 in the dataset). The dataset can be downloaded here. Obviously, there will be customers with many different interests, based on the mix of their YES or NO in these 5 columns.

My goal is to understand which customers are similar to others in such interests. This will help me manage an agenda of product presentations and, in each meeting, I would like to understand the best grouping for it. I started with a hierarchical plot like this:

customer_list <- read.csv("customers_products_wanted.csv", sep=",", header = TRUE)
customer.hclust <- hclust(dist(customers_list))
plot(customer.hclust, customer_list$CUSTOMER)
library(rect.hclust)
rect.clust(customer.hplot,5)

This is the plot I got, asking for 5 clusters:

enter image description here

Tried the same, but with 10 clusters:

enter image description here

Question 1: I know it's always hard to tell, but looking at the charts and dataset, what would be your 'cut' to group customers? 5? 10?

I was reviewing the results, and in the same group, I had CUSTOMER112 with 1,0,1,0,1 as their preferences together with CUSTOMER 110 (1,1,1,1,1), CUSTOMER106 (1,1,1,1,0) and so on. The "distance" can be right, but in a given group I have customers with some relevant differences in their preferences.

Question 2: I don't know if it's a case of total ignorance about clustering, the code I used or even the dataset. Based on your experience, what would be your approach for the best clustering in this case?

Any comments will be highly appreciated. As you see, I did some efforts, but still in doubt.

Thanks a lot!

Ricardo

Solution

All answers were important, but @Ben video recommendation and @Samuel Tan advice on breaking the customers into grids, I found a good way to handle it.

The video gave me a lot of insights about "noisy" variables in hierarchical clustering, and the grid recommendation helped me think on what the data is really trying to tell me.

That said, a basic data cleaning process eliminated all customers with no interests in any products (this is obvious, but I didn't pay attention to it at first). Then, I ignored customers with a specific interest (single product). It was done because these customers wouldn't need to attend the workshop series I'm planning (they just want to listen about one product).

Evaluating all the others, interested in more than one product, I realized the product mix could point me to a better classification. From there, I grouped customers into 3 clusters: integration opportunities (2 or 3 products), convergence opportunities (4 products) and transformation opportunities (all products).

Now it's clear to me which customers I should focus on for my workshops, and plan my post-workshop sales campaigns leveraging materials that target each customer group (integration, convergence, transformation).

Thanks for all the advices!

Ricardo