Search code examples
rloopsfor-loopbinning

Binning data by row values with minimum sample size


I’m trying to figure out how to create bins with a minimum sample size that also accounts for values in a specific column.

So, in the dummy data below, I want to create bins that have a minimum number of 6 samples in them, but if a bin includes a row with a specific value from column a, I want that bin to also include all other rows with that same value. I also do not want any bins to only contain 1 unique value from row a. I then want the output to have a row with a mean of the unique values in column a, a mean of all values in column b and a column with sample size.

 df<-data.frame(a=c(1,1,2,2,2,3,3,3,3,4,4,5,6,6,6,7,7,7,7,7,7,8,8,8,9,9,9,9,10,10,10),
           b=c(12,13,11,12,12,11,15,13,12,11,14,15,11,14,12,11,14,12,13,15,11,11,12,13,14,16,14,13,15,13,15))

I want the output to look something like this:

  mean.a   mean.b n
1    2.0 12.33333 9
2    5.0 12.83333 6
3    7.0 12.66667 6
4    8.5 13.28571 7

This is what I have so far:

x<-df
final<-NULL

for(i in 1:16){
  x1<-x[1:6,]
  x2<-x[-c(1:6),]
  x3<-rbind(x1, x2[x2$a==x1$a[6],])
  n<-nrow(x3)
  y<-mean(x3$b)
  z<-mean(unique(x3$a))
  f<-data.frame(mean.a=z, mean.b=y, n=n)
  final<-rbind(final,f)
  x<-x[-c(1:n),]
}
final<-final[complete.cases(final),]

The problem I'm having is I can't figure out how to not have a single bin with one unique value in column a. For example, in the third bin, all 6 rows have mean.a$a=7, but I would like to add the next sequential row and all rows with that row value in column a to that bin (which would be all rows that have mean.a$a=8 in this case).

Also, I can't figure out how to get the loop to continue looping through without having 1:number at the top, and then just deleting the rows with NAs afterwards, this isn't a huge deal, but that's the reason it's kind of messy.

I'm not attached to this loop by any means, and if there's a simpler way to answer this question, I'm all for it!


Solution

  • Here is a recursive solution for the problem, where get_6 will return a group variable based on the column a. The conditions are met in get_i function inside, starting from index 6 and move forward until we find the next index that is not equal to the current value and the length of unique values is not equal to 1, every time we found a sequence that satisfies the condition we increase the id by one and the result will be similar to what you get from the rleid function from data.table, from there, summary statistics can be calculated based on this group variable:

    get_6 <- function(vec, id = 1) {
        if(length(vec) < 6) NULL
        else {
            get_i <- function(x, i = 6) {
                if(length(x) == i) i
                else if(x[i + 1] != x[i] && length(unique(x[1:i])) != 1) i
                else get_i(x, i + 1)
            }
            ind <- get_i(vec)
            c(rep(id, ind), get_6(vec[-(1:ind)], id + 1))
        }
    }
    
    s <- get_6(df$a)
    s
    # [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4
    
    library(dplyr)
    df[1:length(s), ] %>% 
                          mutate(g = s) %>% group_by(g) %>% 
                          summarize(n = n(), mean.a = mean(unique(a)), mean.b = mean(b))
    
    #Source: local data frame [4 x 4]
    
    #      g     n mean.a   mean.b
    #  <dbl> <int>  <dbl>    <dbl>
    #1     1     9    2.0 12.33333
    #2     2     6    5.0 12.83333
    #3     3     9    7.5 12.44444
    #4     4     7    9.5 14.28571