Search code examples
rdata-bindingcut

How to cut a number in different bin and expand the data frame with the new bins?


I would like to compute something really simple, but I don't find the solution. I want to cut in bins certain numbers, but I want to save the bins.

bin.size = 100 
df = data.frame(x =c(300,400), 
                y = c("sca1","sca2"))
cut(df$x, seq(0, 400, bin.size), 
    include.lowest = TRUE)

Gives me

[1] (200,300] (300,400]
Levels: [0,100] (100,200] (200,300] (300,400]

But what I want something like this:

        bin    y
1   (0,100] sca1
2 (100,200] sca1
3 (200,300] sca1
4   (0,100] sca2
5 (100,200] sca2
6 (200,300] sca2
7 (300,400] sca2

I want to do this because I want to calculate the number of values that enter in bins of 100. For example:

df2 = data.frame(snp = c(1,2,10,100,1,2,14,16,399), 
                 sca = c("sca1","sca1","sca1","sca1","sca2","sca2","sca2","sca2","sca2"))
df2
  snp  sca
1   1 sca1
2   2 sca1
3  10 sca1
4 100 sca1
5   1 sca2
6   2 sca2
7  14 sca2
8  16 sca2
9 399 sca2

snp could be the the position in a vector sca1.

The end goal is to obtain something like this:

        bin    y num
1   (0,100] sca1   4
2 (100,200] sca1   0
3 (200,300] sca1   0
4   (0,100] sca2   4
5 (100,200] sca2   0
6 (200,300] sca2   0
7 (300,400] sca2   1

The best I can do is this:

df2$cat = cut(df2$snp, seq(0, 400, bin.size), 
include.lowest = TRUE)
df2
  snp  sca       cat
1   1 sca1   [0,100]
2   2 sca1   [0,100]
3  10 sca1   [0,100]
4 100 sca1   [0,100]
5   1 sca2   [0,100]
6   2 sca2   [0,100]
7  14 sca2   [0,100]
8  16 sca2   [0,100]
9 399 sca2 (300,400]

Or this:

table(df2$cat,df2$sca)
            sca1 sca2
  [0,100]      4    4
  (100,200]    0    0
  (200,300]    0    0
  (300,400]    0    1

But the problem with this last attempt is that the category (300,400] doesn't make sense for sca1 because it doesn't exist. It should be NA or not appearing. How to solve this?


Solution

  • Here's one way using a few packages from the tidyverse:

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    df %>%
      left_join(nest(df2, snp, .key = "snp"), by = c("y" = "sca")) %>%
      mutate(
        cuts = map(x, ~ seq(0, ., by = 100)),
        tbls = pmap(
          .l = list(snp, cuts),
          .f = function(xx, breaks) {
            z <- table(cut(xx$snp, breaks))
            data_frame(cut = names(z), count = z)
          }
        )
      ) %>%
      select(y, tbls) %>%
      unnest()
    #      y       cut count
    # 1 sca1   (0,100]     4
    # 2 sca1 (100,200]     0
    # 3 sca1 (200,300]     0
    # 4 sca2   (0,100]     4
    # 5 sca2 (100,200]     0
    # 6 sca2 (200,300]     0
    # 7 sca2 (300,400]     1