Search code examples
rcluster-analysis

Clustering with non independent variables and very large data set


I have a very large data set ~ 400 000 instances, that looks like data below.

data  <- as.data.frame(matrix(0, 10, 5))
samp <- function(){
  x <-sample( c(0:9), 5, replace =TRUE, prob = c(0.5, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
  return(x)
}
data <- lapply(split(data, c(1:10)), function(x)  samp() )
data <- do.call(rbind.data.frame, data)
colnames(data) <- c("fail","below_a",  "aver", "above_a", "exceed")
data$class_size <- apply(data[1:5] , 1, FUN = sum) 
class_prof <- sample(letters[1:6], nrow(data), replace = T)   
data$class_prof <- class_prof

I am trying to cluster this set, but there are following problems:

  • class size is the sum of the first five columns - I think it may cause collinearity issue, but it is an important variable.
  • the first five variables are not independent they are the results of measuring the same quality, everyone in the class must fall in one of the categories.
  • the set is really big, the only algorithm that did not have convergence issues was kmeans, (without using class profile variable).

I can drop categorical variable as it can be included in the models in the later stage, but I am keen to try some methods that use it as well and compare results.

For the convergence problems , I tried downsampling, but for many methods, I need to downsample to 5000-7000 to avoid the memory issues, which is the less than 2%of original data.

What method could be applied here using r packages?


Solution

  • Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.

    There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).

    Look into 'prcomp' or 'princomp'