Hi I have a dataset that could be simulated using:
set.seed(123)
v1 <- rbinom(10000, 1, .2)
v2 <- rbinom(10000, 1, .3)
v3 <- rbinom(10000, 1, .25)
v4 <- rbinom(10000, 1, .5)
v5 <- rbinom(10000, 1, .35)
v6 <- rbinom(10000, 1, .2)
v7 <- rbinom(10000, 1, .3)
v8 <- rbinom(10000, 1, .25)
v9 <- rbinom(10000, 1, .5)
v10<- rbinom(10000, 1, .35)
dats <- data.frame(v1,v2,v3,v4,v5,v6,v7,v8,v9,v10)
I am working on using Jaccard distance to create a distance structure as follows:
dat.jac <- philentropy::distance(dats, method = "jaccard")
So here is my question: As these are binary variables, that means there is at most 2^10 = 1024 unique groups. Does this mean that I have over-represented data since I have well over 1024 points? Another way to ask this is do I need to calculate the Jaccard distance of unique observations and use the counts of observations as weights, or can I just calculate the Jaccard distance of each observation (row) to get the distance matrix? In terms of programming, which of the following should I proceed with?
dat.jac <- philentropy::distance(dats, method = "jaccard")
or
dat.jac <- philentropy::distance(unique(dats), method = "jaccard")
My goal is to use the distance matrix in hierarchical clustering using the following code:
dist.jac.mat<- as.matrix(dist.jac)
dist.jac.mat[is.na(dist.jac.mat)] <- 0
hc <- hclust(as.dist(dist.jac.mat), method = "single")
fviz_nbclust(dats, FUN = hcut, diss = as.dist(dist.jac.mat), k.max = 15,
nboot = 250, method = "silhouette")
For single link it is indeed sufficient to only use unique points. Because quantity does not matter for single and complete link. For other linkages this does not generally hold. there you would need to use weighted clustering they and weight by the number of duplicates.
However, you will have a different problem:
Since you have almost every combination, the clustering will be useless. There are only very few possible distances, and your dendrogram will likely only have very few levels until everything is connected. This is inherent to this kind of data. Clustering works best on continuous variables, where you have very few duplicate distances.