I have a dataset in Excel with 4 columns and 8 rows, the first column being types and there are duplicate entries. Columns 2 to 4 are people and the cells are counts of events (sample set). I am not interested in adding up counts for each type per person, what I am after is how many types are found for each person. I can use the =SUMIF(range, criteria, {sum_range]) formula for the person A and in a second step remove the 'zero' counts and get the number of types, going one person at a time. My dataset has thousands of types and hundreds of names so that's not an option.
In essence, it would be about removing duplicates from the column type and indicating in the other columns with a yes/no or true/false if that person has that type.
Is there a formula which does all persons at once?
type | person A | person B | person C |
---|---|---|---|
a | 0 | 45 | 23462 |
b | 456 | 3452 | 0 |
c | 0 | 4 | 4536 |
d | 56 | 0 | 1 |
d | 9 | 1235 | 8 |
b | 0 | 0 | 0 |
c | 78 | 0 | 0 |
I have an R
answer for you. (In the future, please provide your data in an easy-to-copy format!)
df <- data.frame(type = c("a", "b", "c", "d", "d", "b", "c"),
person_a = c(0, 456, 0, 56, 9, 0, 78),
person_b = c(45, 3452, 4, 0, 1235, 0, 0),
person_c = c(23462, 0, 4536, 1, 8, 0, 0))
df |>
tidyr::pivot_longer(cols = -c(type), names_to = "person", values_to = "count") |>
dplyr::group_by(person) |>
dplyr::filter(count != 0) |>
dplyr::summarize(num_types = length(unique(type))) |>
dplyr::ungroup()
This results in a nice little table:
person | num_types |
---|---|
person_a | 3 |
person_b | 4 |
person_c | 3 |
This answer will scale for any arbitrary number of types and people. If your data frame contains other non-person columns that you don't want to aggregate, you could even modify the pivoting as follows: tidyr::pivot_longer(cols = tidyselect::starts_with("person"), names_to = "person", values_to = "count")
. If you want to remove the "person" prefix, add the following parameter to the pivot_longer
call: names_prefix = "person_"
.
EDIT
Is this what you wanted?
df |>
dplyr::group_by(type) |>
dplyr::summarize(dplyr::across(dplyr::starts_with("person"),
function(person) as.numeric(max(person) > 0)))
The table would look like this:
type | person_a | person_b | person_c |
---|---|---|---|
a | 0 | 1 | 1 |
b | 1 | 1 | 0 |
c | 1 | 1 | 1 |
d | 1 | 1 | 1 |
Basically, we summarize the columns of df
that start with "person" to be 1 if at least one of the rows for that type has a positive count and 0 otherwise.