Search code examples
rbinning

How to un-bin previously binned length data for a scatter plot


I have pre-binned length data and I am trying to create a scatter/jitter plot with date on the x axis and length on the y axis. Here is some example data:

df<- as.data.frame(cbind(c(20:30), c(0,0,2,3,5,2,0,0,0,0,0), c(0,0,0,2,5,3,2,0,0,0,0), c(0,0,0,0,1,4,4,3,0,0,0)))
colnames(df) <- c('Length','01-01-2023','01-02-2023','01-03-2023')

example format

Unless there is another way to do this, I need to unbin the lengths, which would look like something like this.

library(reshape2)
df2 <- melt(df, id.var="Length")
colnames(df2)[2:3] <- c("Date", "Number")
df2$Date <- as.Date(df2$Date, "%m-%d-%Y")

After converting to long format, I'm stuck on how to go about unbinning each length occurrence while still maintaining the corresponding date in another column. I've tried using tidyr and rep() to create functions but haven't been able to figure anything out.

rep(df2$Length, df2$Number)

I could use rep() to obtain each length integer, but I still need the date each value corresponds with. I'm thinking maybe using a for loop to loop through each row, replicating the length value by the number of occurrences, but I still need to return the date within each loop.

I also tried using tidyr and lapply but couldn't get that to work either.

fun <- function(x){
df2[x,] %>% 
    tidyr::uncount(value) %>% 
    magrittr::use_series("Length")
}

 lapply(unique(test$Date),fun)

I haven't seen many questions about people needing to un-bin already binned data, so any help would be appreciated. Thank you!


Solution

  • I would do it like this:

    ## with uncount
    df |>
      pivot_longer(-Length, names_to = "Date", values_to = "n") |>
      uncount(n)
    # A tibble: 36 × 2
       Length Date      
        <dbl> <chr>     
     1     22 01-01-2023
     2     22 01-01-2023
     3     23 01-01-2023
     4     23 01-01-2023
     5     23 01-01-2023
     6     23 01-02-2023
     7     23 01-02-2023
     8     24 01-01-2023
     9     24 01-01-2023
    10     24 01-01-2023
    # ℹ 26 more rows
    # ℹ Use `print(n = ...)` to see more rows
    
    
    ## or with rep
    result = df |>
      pivot_longer(-Length, names_to = "Date", values_to = "n") |>
      filter(n > 0)
    result = result[rep(1:nrow(result), times = result$n), ]
    result
    # # A tibble: 36 × 3
    #    Length Date           n
    #     <dbl> <chr>      <dbl>
    #  1     22 01-01-2023     2
    #  2     22 01-01-2023     2
    #  3     23 01-01-2023     3
    #  4     23 01-01-2023     3
    #  5     23 01-01-2023     3
    #  6     23 01-02-2023     2
    #  7     23 01-02-2023     2
    #  8     24 01-01-2023     5
    #  9     24 01-01-2023     5
    # 10     24 01-01-2023     5
    # # ℹ 26 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    (You can drop the n column, of course. I just left it in to check that it worked.)