Search code examples
rdplyrcountsummarize

Count unique occurrences of factor levels and numeric values with dplyr, on data in a long format


I have data on repeated measurements of 8 patients, each with varying amount of repeated measurements on the same variables. The measured variables are sex, blood pressure (sys_bp), and how many CT scans a person underwent:

library(dplyr)
library(magrittr)

questiondata <- structure(list(id = c(2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 
4, 7, 7, 8, 8, 8, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 20, 
20, 20), time = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 
1L, 2L, 3L, 4L, 5L, 1L, 6L, 1L, 2L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L, 1L, 2L, 4L), .Label = c("T0", "T1M0", "T1M6", 
"T1M12", "T2M0", "FU1"), class = "factor"), sys_bp = c(116, 125.8, 
NA, NA, NA, 113.2, NA, NA, NA, NA, 146, NA, NA, NA, NA, NA, NA, 
125, NA, NA, 164.5, NA, NA, NA, NA, 150.5, NA, NA, NA, NA, 158, 
NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("female", "male"), class = "factor"), 
    ct_amount = c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 3L, 3L, 3L)), row.names = c(NA, -32L), class = c("tbl_df", 
"tbl", "data.frame"))

questiondata

      id time  sys_bp sex    ct_amount
   <dbl> <fct>  <dbl> <fct>      <int>
 1     2 T0      116  female         4
 2     2 T1M0    126. female         4
 3     2 T1M6     NA  female         4
 4     2 T1M12    NA  female         4
 5     3 T0       NA  female         5
 6     3 T1M0    113. female         5
 7     3 T1M6     NA  female         5
 8     3 T1M12    NA  female         5
 9     3 T2M0     NA  female         5
10     4 T0       NA  male           5
11     4 T1M0    146  male           5
12     4 T1M6     NA  male           5
13     4 T1M12    NA  male           5
14     4 T2M0     NA  male           5
15     7 T0       NA  female         2
16     7 FU1      NA  female         2
17     8 T0       NA  female         3
18     8 T1M0    125  female         3
19     8 T2M0     NA  female         3
20    13 T0       NA  female         5
21    13 T1M0    164. female         5
22    13 T1M6     NA  female         5
23    13 T1M12    NA  female         5
24    13 T2M0     NA  female         5
25    14 T0       NA  male           5
26    14 T1M0    150. male           5
27    14 T1M6     NA  male           5
28    14 T1M12    NA  male           5
29    14 T2M0     NA  male           5
30    20 T0       NA  female         3
31    20 T1M0    158  female         3
32    20 T1M12    NA  female         3

I am trying to count the number of persons that (1) is male/female (2) has 1/2/3/4/5 CT scans.

So the output would be that there are (1) 6 females and 2 males, and (2) 1 person with 2 CTs, 2 persons with 3 CTs, 1 person with 4 CTs and 4 persons with 5 CTs.

I've tried many combinations of group_by and summarise and count, but can't seem to get it right. Any help?


Solution

  • You can first keep only the unique rows for each id. Then use count to get the output.

    library(dplyr)
    
    unique_data <- questiondata %>% distinct(id, .keep_all = TRUE)
    
    unique_data %>% count(sex)
    # A tibble: 2 x 2
    #  sex        n
    #  <fct>  <int>
    #1 female     6
    #2 male       2
    
    unique_data %>% count(ct_amount)
    
    # A tibble: 4 x 2
    #  ct_amount     n
    #      <int> <int>
    #1         2     1
    #2         3     2
    #3         4     1
    #4         5     4