Search code examples
rvariablessplitcategories

R: create a new categorical variable from a categorical variable based on a continuous variable


I already had a look here, where the cut function is used. However, I haven't been able to come up with a clever solution given my situation.

First some example data that I currently have:

df <- data.frame(
  Category = LETTERS[1:20], 
  Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90)
)

I would like to make a third column that forms a new category based on the Nber_within_category column. In this example, how can I make e.g. Category_new such that in each category, the Nber_within_category is at least 5 with the constrain that if Category already has Nber_within_category >= 5, that the original category is taken.

So for example, it should look like this:

df <- data.frame(
  Category = LETTERS[1:20], 
  Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90),
  Category_new = c(rep('a',5), rep('b', 4), rep('c',2), LETTERS[12:20])
)

Solution

  • It's a bit of a hack, but it works:

    df %>% 
      mutate(tmp = floor((cumsum(Nber_within_category) - 1)/5)) %>% 
      mutate(new_category = ifelse(Nber_within_category >= 5,
                                   Category,
                                   letters[tmp+1]))
    

    The line floor((cumsum(Nber_within_category) - 1)/5) is a way of categorising the cumsum with bins of size 5 (-1 to include the rows where the sum is exactly 5), and which I'm using as an index to get new categories for the rows where Nber_within_category < 5

    It might be easier to understand how the column tmp is defined if you run :

    x <- 1:100
    data.frame(x, y = floor((x- 1)/5))