Search code examples
rexcelduplicatescounting

Is there a COUNTIF function in Excel which does the entire dataset? (Or something similar in R)


I have a dataset in Excel with 4 columns and 8 rows, the first column being types and there are duplicate entries. Columns 2 to 4 are people and the cells are counts of events (sample set). I am not interested in adding up counts for each type per person, what I am after is how many types are found for each person. I can use the =SUMIF(range, criteria, {sum_range]) formula for the person A and in a second step remove the 'zero' counts and get the number of types, going one person at a time. My dataset has thousands of types and hundreds of names so that's not an option.

In essence, it would be about removing duplicates from the column type and indicating in the other columns with a yes/no or true/false if that person has that type.

Is there a formula which does all persons at once?

type person A person B person C
a 0 45 23462
b 456 3452 0
c 0 4 4536
d 56 0 1
d 9 1235 8
b 0 0 0
c 78 0 0

Solution

  • I have an R answer for you. (In the future, please provide your data in an easy-to-copy format!)

    df <- data.frame(type = c("a", "b", "c", "d", "d", "b", "c"), 
                     person_a = c(0, 456, 0, 56, 9, 0, 78), 
                     person_b = c(45, 3452, 4, 0, 1235, 0, 0),
                     person_c = c(23462, 0, 4536, 1, 8, 0, 0))
    
    df |> 
      tidyr::pivot_longer(cols = -c(type), names_to = "person", values_to = "count") |> 
      dplyr::group_by(person) |> 
      dplyr::filter(count != 0) |> 
      dplyr::summarize(num_types = length(unique(type))) |>
      dplyr::ungroup()
    

    This results in a nice little table:

    person num_types
    person_a 3
    person_b 4
    person_c 3

    This answer will scale for any arbitrary number of types and people. If your data frame contains other non-person columns that you don't want to aggregate, you could even modify the pivoting as follows: tidyr::pivot_longer(cols = tidyselect::starts_with("person"), names_to = "person", values_to = "count"). If you want to remove the "person" prefix, add the following parameter to the pivot_longer call: names_prefix = "person_".

    EDIT

    Is this what you wanted?

    df |> 
      dplyr::group_by(type) |> 
      dplyr::summarize(dplyr::across(dplyr::starts_with("person"), 
                       function(person) as.numeric(max(person) > 0)))
    

    The table would look like this:

    type person_a person_b person_c
    a 0 1 1
    b 1 1 0
    c 1 1 1
    d 1 1 1

    Basically, we summarize the columns of df that start with "person" to be 1 if at least one of the rows for that type has a positive count and 0 otherwise.