Search code examples
rtidyversestartswithdata-wranglingredcap

Using the Tidyverse to the count the frequency of a specific string in REDCap data across multiple columns


I frequently receive data from REDCap surveys in which respondents are permitted to "Check" >1 response to a survey question. Each potential response is housed in its own column. I'd like to summarize the frequency with which each response option (column) has been checked. For example:

library(tidyverse)
set.seed(1234)
responses<-c("Checked", "Unchecked")
numobs<-10

my_example<-data.frame(id=1:10, 
                       Response_Option_A=sample(responses, numobs, replace=TRUE), 
                       Response_Option_B=sample(responses, numobs, replace=TRUE), 
                       Response_Option_C=sample(responses, numobs, replace=TRUE),
                       Response_Option_D=sample(responses, numobs, replace=TRUE),
                       stringsAsFactors = FALSE)

my_example
#>    id Response_Option_A Response_Option_B Response_Option_C Response_Option_D
#> 1   1         Unchecked         Unchecked         Unchecked           Checked
#> 2   2         Unchecked         Unchecked         Unchecked         Unchecked
#> 3   3         Unchecked         Unchecked         Unchecked           Checked
#> 4   4         Unchecked           Checked         Unchecked           Checked
#> 5   5           Checked         Unchecked         Unchecked           Checked
#> 6   6         Unchecked         Unchecked         Unchecked         Unchecked
#> 7   7           Checked         Unchecked           Checked           Checked
#> 8   8           Checked           Checked         Unchecked         Unchecked
#> 9   9           Checked         Unchecked         Unchecked         Unchecked
#> 10 10         Unchecked         Unchecked         Unchecked           Checked

My initial inclination was to try this, but it returns the total number of responses checked, rather than the number in each column.

my_example %>%
  select(starts_with("Response_Option_")) %>%
    summarise(checked=sum(.=="Checked"))
#>   checked
#> 1      13

Created on 2020-08-10 by the reprex package (v0.3.0)

Help summarizing these responses efficiently is appreciated.


Solution

  • This is a tidyverse approach to show the response totals per column, not by row. I assume, by how you worded your question, that this is what you are looking for. Also included is the starts_with() function which was included in your question tags.

    We can use pivot_longer() to pivot the response features from wide to long then use group_by to define the variables taking the existing table and converting it to a grouped table where the summarise() operation is used to create a new data frame with rows provided for each combination of grouping variables.

    library(tidyverse)
    set.seed(1234)
    responses<-c("Checked", "Unchecked")
    numobs<-10
    
    my_example<-data.frame(id=1:10, 
                           Response_Option_A=sample(responses, numobs, replace=TRUE), 
                           Response_Option_B=sample(responses, numobs, replace=TRUE), 
                           Response_Option_C=sample(responses, numobs, replace=TRUE),
                           Response_Option_D=sample(responses, numobs, replace=TRUE),
                           stringsAsFactors = FALSE)
    
    my_example %>% 
      pivot_longer(starts_with("Response_"), names_to = "Responses", 
                   values_to = "value") %>% 
      group_by(Responses, value) %>%
      summarise(total_responses = n())
    
    
    #> # A tibble: 8 x 3
    #> # Groups:   Responses [4]
    #>   Responses         value     total_responses
    #>   <chr>             <chr>               <int>
    #> 1 Response_Option_A Checked                 4
    #> 2 Response_Option_A Unchecked               6
    #> 3 Response_Option_B Checked                 2
    #> 4 Response_Option_B Unchecked               8
    #> 5 Response_Option_C Checked                 1
    #> 6 Response_Option_C Unchecked               9
    #> 7 Response_Option_D Checked                 6
    #> 8 Response_Option_D Unchecked               4
    

    Created on 2020-08-10 by the reprex package (v0.3.0)

    If you want just the Checked responses, you can add the following line of code after the summarise() operation:

    filter(value == "Checked")
    
    #> # A tibble: 4 x 3
    #> # Groups:   Responses [4]
    #>   Responses         value   total_responses
    #>   <chr>             <chr>             <int>
    #> 1 Response_Option_A Checked               4
    #> 2 Response_Option_B Checked               2
    #> 3 Response_Option_C Checked               1
    #> 4 Response_Option_D Checked               6