Search code examples
rdataframedplyrmergedatatable

Add a new column with sum of count to a dataframe according to informations from another in R


I would need help in order to add count column into a table called tab1 according to another tab2.

Here is the first tab :

tab1

    Event_Groups Other_column
1      1_G1,2_G2            A
2           2_G1            B
3           4_G4            C
4 7_G5,8_G5,9_G5            D

as you can see in Event_Groups column I have 2 information (Event and Groups numbers separated by a "_"). These informations will also be found in tab2$Group and tab2$Event and the idea is for each element within rows in tab1 (separated by a comma) , to count the number of rows within tab2 where VALUE1 < 10 AND VALUE2 > 30 and then add this count into tab1 in a new column called Sum_count.

Here is the tab2

  Group Event VALUE1 VALUE2
1    G1     1      5     50  <- VALUE1 < 10 & VALUE2 > 30 : count 1  
2    G1     2     6      20  <- VALUE2 < 30  : count 0 
3    G2     2     50     50  <- VALUE1 > 10  : count 0
4    G3     3      0      0
5    G4     1      0      0
6    G4     4      2     40  <- VALUE1 < 10 & VALUE2 > 30 : count 1 
7    G5     7      1     70  <- VALUE1 < 10 & VALUE2 > 30 : count 1 
8    G5     8      4     67  <- VALUE1 < 10 & VALUE2 > 30 : count 1 
9    G5     9      3     60  <- VALUE1 < 10 & VALUE2 > 30 : count 1 

Example :

  • For instance for the first element of row1 in tab1: 1_G1 we see in tab2 (row1) that VALUE1 < 10 & VALUE2 > 30, so I count 1.
  • For the seconde element (row1) : 2_G2 we see in tab2 (row3) that VALUE1 > 10, so I count 0.

And here is the expected result tab1 dataframe;

Event_Groups     Other_column Sum_count
1_G1,2_G2        A            1
2_G1             B            0
4_G4             C            1
7_G5,8_G5,9_G5   D            3

I dot not know if I am clear enough, do not hesitate to ask questions.


Here are the two tables in dput format if it can helps:

tab1

structure(list(Event_Groups = structure(1:4, .Label = c("1_G1,2_G2", 
"2_G1", "4_G4", "7_G5,8_G5,9_G5"), class = "factor"), Other_column =
structure(1:4, .Label = c("A",  "B", "C", "D"), class = "factor")),
class = "data.frame", row.names = c(NA, 
-4L))

tab2

structure(list(Group = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 
5L, 5L), .Label = c("G1", "G2", "G3", "G4", "G5"), class = "factor"), 
    Event = c(1L, 2L, 2L, 3L, 1L, 4L, 7L, 8L, 9L), VALUE1 = c(5L, 
    6L, 50L, 0L, 0L, 2L, 1L, 4L, 3L), VALUE2 = c(50, 20, 50, 
    0, 0, 40, 70, 67, 60)), class = "data.frame", row.names = c(NA, 
-9L))

Solution

  • Here is one way to do it:

    library(dplyr)
    library(tidyr)
    
    tab1 %>% 
      mutate(Event_Groups = as.character(Event_Groups)) %>% 
      separate_rows(Event_Groups, sep = ",") %>% 
      left_join(., 
                tab2 %>% 
                  unite(col = "Event_Groups", Event, Group) %>% 
                  mutate(count = if_else(VALUE1 < 10 & VALUE2 > 30,1L, 0L))) %>%
      group_by(Other_column) %>% 
      summarise(Event_Groups = paste(unique(Event_Groups), collapse = ","),
                Sum_count = sum(count)) %>% 
      select(Event_Groups, everything())
    
    #> Joining, by = "Event_Groups"
    #> `summarise()` ungrouping output (override with `.groups` argument)
    #> # A tibble: 4 x 3
    #>   Event_Groups   Other_column Sum_count
    #>   <chr>          <fct>            <int>
    #> 1 1_G1,2_G2      A                    1
    #> 2 2_G1           B                    0
    #> 3 4_G4           C                    1
    #> 4 7_G5,8_G5,9_G5 D                    3
    

    Created on 2021-07-29 by the reprex package (v0.3.0)