I have a very large data set ~ 400 000 instances, that looks like data below.
data <- as.data.frame(matrix(0, 10, 5))
samp <- function(){
x <-sample( c(0:9), 5, replace =TRUE, prob = c(0.5, 0.1, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05))
return(x)
}
data <- lapply(split(data, c(1:10)), function(x) samp() )
data <- do.call(rbind.data.frame, data)
colnames(data) <- c("fail","below_a", "aver", "above_a", "exceed")
data$class_size <- apply(data[1:5] , 1, FUN = sum)
class_prof <- sample(letters[1:6], nrow(data), replace = T)
data$class_prof <- class_prof
I am trying to cluster this set, but there are following problems:
I can drop categorical variable as it can be included in the models in the later stage, but I am keen to try some methods that use it as well and compare results.
For the convergence problems , I tried downsampling, but for many methods, I need to downsample to 5000-7000 to avoid the memory issues, which is the less than 2%of original data.
What method could be applied here using r packages?
Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.
There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).
Look into 'prcomp' or 'princomp'