Search code examples
rtidyverse

In R, how to find the proportion of cases which have a value present in another column?


This seemed really simple to me at first, but is unexpectedly giving me trouble. Let's say my dataset looked like this:

mock <- tribble(~case_id, ~characteristic,
                        1, "A", 1, "A", 1, "B", 2, "A",2, "A", 3, "B", 3, "B", 4, "C", 5, "A",5, NA)

What I'd like to do is to calculate the proportion of cases where any of the values of characteristic are equal to "A". In this example, I'd want to calculate that 3/5 of cases fit my criteria.

I've managed to do this through a combination of group_by(), case_when(), any(),distinct(), and count(), but I know there has to be some much, much simpler method that I'm just not seeing.

Thanks for your help!


Solution

  • Here's a couple ways:

    mock |> 
      summarize(
        result = n_distinct(case_id[characteristic == "A"], na.rm = TRUE) / 
        n_distinct(case_id)
      )
    ## A tibble: 1 × 1
    #   result
    #    <dbl>
    # 1    0.6
    
    mock |> 
      summarize(has_a = "A" %in% characteristic, .by = case_id) |> 
      summarize(result = mean(has_a))
    ## A tibble: 1 × 1
    #     result
    #      <dbl>
    # 1      0.6