Search code examples
rstatisticsk-meansbin-packing

Packing customers into buckets


I have say 25 customers. Each customer has a number of users of our system, e.g. customer 1 has 45 users, customer 2 has 46 users... customer 25 has 1000 users.

I want to bin each customer into a bucket, where each bucket contains a roughly equal number of users. I know that I want 5 buckets in total.

(The buckets here represent servers, I want to apportion my clients to different servers where the total number of users per server is roughly equal, so as to prevent overloading the servers. 1 client has to be on the same server (i.e. can't split 1 client over 2 servers).

Any idea of suitable methods for apportioning customers to buckets? I thought some clustering methods might work (I tried kmeans using R), but I cant seem to find ways of stipulating that the total number of users in each cluster is roughly the same.

Here's my R code as an example of what I've done so far:

#Create dataset
r <- data.frame(users=c(1000, 960, 920, 870, 850, 700, 600, 550, 520, 500, 420, 400, 390, 300, 210, 200, 160, 80, 70, 50, 49, 48, 47, 46, 45))
#Try kmeans clustering
fit <- kmeans(r, 5) 
#get cluster means
aggregate(r, by=list(fit$cluster),FUN = mean)
#append cluster assignment
r <- data.frame(r,fit$cluster)

#Plot cluster
library(cluster)
clusplot(r, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
library(fpc)
plotcluster(r, fit$cluster)

This clusters my customers into buckets, but the number of users in each bucket is not roughly equal.

I've tagged this as an R problem, but if there's a simple solution in some other package I'm all ears :-)


Solution

  • I don't know what the recommended solution for such 'constant sum sampling ' is. Here's my shot at it -- sort the items, convert to matrix where each column represents a sample, reverse every other row.

    Here's the code:

    set.seed(1024)
    r <- data.frame(users=c(1000, 960, 920, 870, 850, 700, 600, 550, 520, 500, 420, 400, 390, 300, 210, 200, 160, 80, 70, 50, 49, 48, 47, 46, 45))
    
    a<-   r$users #runif(n = 25, 100,400) #rnorm(25,100,100) # 1:25
    #hist(a)
    df<- data.frame(id=1:25,x=a)
    
    # sort 
    x<- df$id[order(df$x)]
    # convert to matrix
    #each column of this matrix represetns one sample
    xm<-matrix(x,ncol=5,byrow = T); xm
    oldsum<-apply(matrix(df$x,ncol=5,byrow = T), 2,sum)
    
    #flip alternate rows of this sorted matrix
    i= 1:nrow(xm)
    im=i[c(F,T)]
    xm[im,]
    xm[im,]<- rev(xm[im,])
    
    # new matrix of indeices 
    xm
    
    #hence the new matrix of values
    xm2<- matrix(a[c(xm)],ncol = 5, byrow = F)
    xm
    xm2
    
    newsum<- (apply(xm2, 2,sum))
    
    # improvement
    rbind(oldsum,newsum)
    barplot(rbind(oldsum,newsum)[1,])
    barplot(rbind(oldsum,newsum)[2,])
    
    # each column of following matrix represents one sample 
    #(values are indices in original vector a)
    xm