Search code examples
rdplyrfrequency

How to get the frequency( count) of Variable C when Variables A and B are mentioned together?


I have the following dplyr code:

df3 <- Table3%>%
  group_by(Q6,Q9,Q11) %>%
  summarise(count = n()) %>%
  mutate(per = paste0(round(100 *count/sum(count),2),'%')) %>% 
  ungroup()
 

Q6 is a name, Q9 describes a topic that can be used for any value of Q6, and Q11 is a Y/N (1/2) question if there is a goal mentioned.

I am not sure if my code is correct for the interpretation that i need because i am not sure what summarize does, or counts when there are 3 variables. So i don't know what variable is the count.

summarise(count = n()) %>%

in short I want to get a frequency and percent of how many times a goal was mentioned or not, every time Q6 and Q9 appeared together. I get this output but I am not sure if this is the correct frequency(count).

   Q6    Q9    Q11   count per   
   <chr> <chr> <chr> <int> <chr> 
 1 0     104   2         1 100%  
 2 0     105   2         1 100%  
 3 0     22    2         1 100%  
 4 0     25    2         1 100%  
 5 0     29    2         1 100%  
 6 0     30    2         1 100%  
 7 0     31    1         1 100%  
 8 0     42    1         1 100%  
 9 0     44    2         2 66.67%
10 0     44    NA        1 33.33%
11 0     5     1         1 100%  
12 0     51    NA        1 100%  
13 0     52    1         1 100%  
14 0     63    2         1 100%  
15 0     7     1         1 100%  
16 0     76    1         1 100%  
17 0     77    2         1 100%  
18 0     83    2         1 100%  
19 0     85    2         1 100%  
20 0     NA    NA        9 100%  
21 1     14    1         1 100%  
22 1     39    1         1 50%   
23 1     39    2         1 50%   
24 101   0     1         1 100%  
25 101   42    1         1 100%  

this is a table of more than 500 rows, so I'd need to order them in decreasing order. So for example in the table below, line 2 would have to mean "When Q9(=44) was mentioned for Q6(=23), 8 times there was no goal mentioned (Q11=2) ."

lines 3,4 and 5 would be interpreted: "for Q6(=52), when topic 30 was mentioned in 8 instances there was also a goal mentioned, but when topic 89 was mentioned there was no goal in 7 instances, and in 6 instances when topic 29 was mentioned."

the percent throws me off, i am not sure how to interpret it, but i need to have it.

 Q6    Q9    Q11   count per   
   <chr> <chr> <chr> <int> <chr> 
 1 0     NA    NA        9 100%  
 2 23    44    2         8 100%  
 3 52    30    1         8 61.54%
 4 52    89    2         7 100%  
 5 52    29    2         6 66.67%
 6 66    63    1         6 54.55%
 7 97    30    1         6 60%   
 8 52    30    2         5 38.46%
 9 60    42    2         5 55.56%
10 66    63    2         5 45.45%
11 19    51    2         4 80%   
12 19    7     1         4 66.67%
13 24    49    2         4 57.14%
14 52    99    2         4 100%  
15 53    41    2         4 100%  
16 60    105   2         4 80%   
17 60    42    1         4 44.44%
18 97    30    2         4 40%   
19 97    60    2         4 100%  
20 19    16    2         3 100%  
21 24    49    1         3 42.86%
22 272   7     1         3 100%  
23 5     46    2         3 100%  
24 52    29    1         3 33.33%
25 52    31    1         3 100%  

is this correct? or does my count mean something else?

would really appreciate help with the interpretation, or a better code for what i am looking for?

Thank you!


Solution

  • n() returns you the number of cases of that particular combination in group_by. As you showed two different outputs, I'm not sure exactly how you got them, and so, not sure how to interpret your %s.

    Without a reproducible example, it's hard to help you fully. But if I got it right, you're on the right track. I'd just be careful with counting inside different group settings.

    There is definitely a more cleaver way of doing it, but I'd break it down in two steps, as in the code below, to not mess with different count numbers given different grouping variables

    library(dplyr)
    
    ## Crete some fake data
    set.seed(101)
    
    df <- 
      data.frame("Q6" = sample(8:10, size = 50, replace = TRUE),
                 "Q9" = round(rnorm(n = 50, mean = 32, sd = 2), digits = 0),
                 "Q11" = sample(1:2, size = 50, replace = TRUE))
    
    ## Then summarise the number of occurrences
    ## based on combinations of Q6 and Q9
    ## i.e. how many times that combination of Q6 and Q9 happened 
    
    out1 <- 
      df %>%
      group_by(Q6, Q9) %>%
      summarise(n_q6_q9 = n())
    
    ## Then count the number of Y/N (your Q11) by combinations of Q6 and Q9
    ## i.e. how many Y or N for each Q6~Q9 combination
    
    out2 <- 
      df %>%
      group_by(Q6, Q9, Q11) %>%
      summarise(n_q11 = n())
    
    ## Merge them and calculate the percentage
    
    out_final <- 
      left_join(out2, out1, by = c("Q6", "Q9")) %>%  ## Note order of out2 and out1
      mutate(per = paste0(round(n_q11/n_q6_q9 * 100, digits = 2), "%")) 
      # %>% ## Not sure if you need to arrange it?
      # group_by(Q6, Q9) %>%
      # arrange(per)