Search code examples
rdataframedplyrcountingmutate

Count the number of types in the groups of data frame using R


I have a data like this:

data<-data.frame(is.on=c("FALSE","FALSE","FALSE","TRUE","FALSE","TRUE","FALSE","FALSE","TRUE","TRUE","TRUE","TRUE"),
                 dur=c(10,20,30,10,10,10,10,20,10,20,30,40),
                 dt=c(10,10,10,10,10,10,10,10,10,10,10,10),
                 block=c(2,2,2,3,4,5,6,6,7,7,7,7),
                 interval_block=c(1,1,1,2,2,2,3,3,3,4,4,4))

Now I want to make summary_data based on block. The number of rows of summary_data is the number of types of interval_block. step1:

# Step 1: Find the maximum number of types for block column within each interval_block
max_types <- sapply(unique(data$interval_block), function(interval) {
  blocks <- unique(data[data$interval_block == interval, "block"])
  length(blocks)
})
max_num_types <- max(max_types) 

For interval_block=1, there is one type of block. (2) For interval_block=2, there are three types of block. (3,4 and 5) For interval_block=3, there are two types of block. (6 and 7) For interval_block=4, there is one type of block. (7) So the maximum number of types for block column within each interval_block is 3. And the above is the code to calculate that number. Based on this number, I want to make dur_ columns. So, in this case, There should be dur_1,dur_2 and dur_3.

Step2: Decide the values of dur_ columns. For interval_block=1, there is one type of block. I want to fill dur_1 and leave dur_2 and dur_3 as 0. #(block=2 within interval_block=1)=3. So, I want to fill dur_1 as 3 times 10=30.

For interval_block=2,there are three types of block. I want to fill dur_1, dur_2 and dur_3. #(block=3 within interval_block=2)=1, #(block=4 within interval_block=2)=1, #(block=5 within interval_block=2)=1. So, I want to fill dur_1 as 1 times 10=10, dur_2 as 1 times 10=10 and dur_3 as 1 times 10=10.

For interval_block=3,there are two types of block. I want to fill dur_1, dur_2 and leave dur_3 as 0. #(block=6 within interval_block=3)=2, #(block=7 within interval_block=3)=1, So, I want to fill dur_1 as 2 times 10=20, dur_2 as 1 times 10=10 and dur_3 as 0.

For interval_block=4,there is one type of block. I want to fill dur_1 and leave dur_2 and dur_3 as 0. #(block=7 within interval_block=4)=3. So, I want to fill dur_1 as 3 times 10=10, dur_2 and dur_3 as 0.

I described the rules quite long, but basically it is all about counting the number of types within interval_block and multiply to 10. My expected output should look like this:

summary_data<-data.frame(dur_1=c(30,10,20,30),
                     dur_2=c(0,10,10,0),
                     dur_3=c(0,10,10,0),
                     interval_block=c(1,2,3,4))

I don't know how to code in R.

For clarification. First row: there are 3 block=2 (one type). Sine one type, we fill only dur_1 with 3 times 10. Second row, there are 1 block=3 , 1 block=4 and 1 block=5 (three types). Since three types, we fill dur_1,dur_2 and dur_3 with 1 times 10, 1 times 10, 1 times 10 respectively.

Third row: there are 2 block=6 , 1 block=7 (two types). Since two types, we fill dur_1,dur_2 with 2 times 10, 1 times 10 respectively.


Solution

  • Taking advantage of {dplyr} and {tidyr}, you could do the following:

    library(dplyr)
    library(tidyr)
    
    data |>
      group_by(interval_block) |>
      mutate(ID = row_number(),
             dur = block |> as.factor() |> as.integer(),
             dur = 1 + dur - min(dur),
             dur_names = paste0('dur_', dur),
             dur_values = 10 * dur
             ) |>
      group_by(interval_block, dur_names) |>
      summarise(dur_values = sum(dur_values)) |>
      pivot_wider(names_from = dur_names, values_from = dur_values) |>
      mutate(across(everything(), ~ ifelse(is.na(.x), 0, .x))) |>
      select(starts_with('dur'), interval_block)
    
    # A tibble: 4 x 4
    # Groups:   interval_block [4]
      dur_1 dur_2 dur_3 interval_block
      <dbl> <dbl> <dbl>          <dbl>
    1    30     0     0              1
    2    10    20    30              2
    3    20    20     0              3
    4    30     0     0              4
    

    Edit: a slightly esoteric alternative with base R:

    data |>
      split(data$interval_block) |>
      Map(f = \(x) {
        max_blocks = with(data,  max(table(interval_block, block)))
        dur <- table(x$block)
        `[<-`(integer(max_blocks), seq_along(dur), 10 * dur)
      }) |>
      Reduce(f = rbind) |>
      cbind(unique(data$interval_block)) |>
      as.data.frame(row.names = FALSE) |>
      setNames(nm = c(paste0('dur_', 1:3), 'interval block'))
    

    '[<-' for zero-padding taken from here