R: Feature selection in Affinity Propagation

Rookie here for help.

So I'm doing some clustering on a dataset, using Affinity propagation, APcluster. Now I have a few problem want to address:

1: visualization, apparently the Plot() function doesn't like my table when there are more than 15 features. Any method to bypass this?

2: related to 1, so I thought I can reduce the features from the table by using PCA, or check correlation. The former will give me 2-3 key component to work with, and the latter should allow me to eliminate the redundant ones. However, PCA isn't doing great, PCA1 and 2 only accounts for 22% ...

So is there a chance to build a loop, in which I could randomly select on feature from the table and drop it, ran the APcluster with the remaining features, and iterate this process for all features. And compare the resulting clustering results to see which features are redundant and which feature are key players.

Obvisouly, this demands to know what are the good results. This question 2 part is where I have no idea how to achieve, in both coding and clustering knowledge. Could really appreciate some guidance.

Below is the minimum mock data and my codes for APcluster:

#minimum dataset
Id <- c(1:30)
timestamp <- rep(c("20200512","20180212","20181204" ),10)
f_1 <- runif (30, 0.0, 20)
f_2 <- runif (30, 0.0, 500)
f_3 <- runif (30, 0.0, 15)
f_4 <- runif (30, 0.0, 8.6)
f_5 <- runif (30, 0.0, 200)
f_6 <- runif (30, 0.0, 250)
f_7 <- runif (30, 0.0, 2000)
f_8 <- runif (30, 0.0, 35)
f_9 <- runif (30, 0.0, 20)
f_10 <- runif (30, 0.0, 14)
f_11 <- runif (30, 0.0, 10)
f_12 <- runif (30, 0.0, 89)

df <- data.frame(Id,timestamp,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12)

#drop labels
sampleID <- df$Id
Time <- df$timestamp
sampleID <-NULL
Time <- NULL
#scale numerical data     
scaled_df <- scale(df[,3:14])

#APcluster
library(apcluster)
apres <- apcluster(negDistMat(r=2), scaled_df, details=TRUE)

show(apres)

plot(apres, df)

Solution

I think you should not be preoccupied with choosing the variables for clustering but go ahead with the visualization and make sense of the clustering.

If there is indeed signal, meaning some indications of separation in your dataset, most clustering algorithms will use more of the useful columns. So for example in the example below, I use the iris cluster, but added 10 more nonsense columns:

library(apcluster)
set.seed(100)
df = iris[,1:4]
df = cbind(df,matrix(rnbinom(nrow(df)*10,mu=10,size=1),ncol=10))
scaled_df = scale(df)
pca = prcomp(scaled_df)
plot(pca$x[,1:2],col=iris$Species)

Above you can see that if you project onto the first PCs, you can still see that separation that was coming from the original data. Looking at variance explained, it's about 30%, which makes sense because the other columns are simply noise:

head(100*(pca$sdev^2)/sum( (pca$sdev^2)))
[1] 21.624484  9.839297  9.657884  8.239994  7.875396  7.289238

Now, if we do the clustering, we can also take out the cluster id and not be restricted to the plot function provided by the package:

apres <- apcluster(negDistMat(r=2), scaled_df,q=0.01)
clusters = apres@clusters
clusterid = data.frame(
cluster = rep(1:length(clusters),sapply(clusters,length)),
obs = unlist(clusters)
)

clusterid = clusterid[order(clusterid$obs),]
head(clusterid)
      cluster obs
1       2   1
2       2   2
3       2   3
4       2   4
5       2   5
6       2   6

Now we have a data.frame that tells us for every observation we provided as input, the cluster assigned. Let's project this onto the pca to see how it's separated:

library(RColorBrewer)
COLS = brewer.pal(length(unique(clusterid$cluster)),"Set3")
plot(pca$x[,1:2],col=COLS[clusterid$cluster],pch=20,cex=0.5)

We can look at the variables that contain the signal,

plot(df[,1:2],col=COLS[clusterid$cluster],pch=20,cex=0.5)

And those that don't:

plot(df[,9:10],col=COLS[clusterid$cluster],pch=20,cex=0.5)

So if there is indeed a sensible way of clustering your data, which your pca suggest, then using the approach above, you can already explore your data and find out whether the clustering makes sense. Tune your clustering parameters and you can easily visualize the end result.