I am trying to add a column to a dataframe in R that provides counts for the number of times a unique value in one column has a value of 1 in a binary column. The data is from a study that involved participants listening to sentences and marking syllables that sounded high-pitched. Here is a sample of the data, with identifiers for ten syllables in syll
and three participants in id
. The columns id
and syll
need to be categorical, and high
needs to be binary/numeric.
id syll high
1 1 0
1 2 0
1 3 1
1 4 0
1 5 0
1 6 0
1 7 0
1 8 0
1 9 0
1 10 0
2 1 0
2 2 1
2 3 1
2 4 0
2 5 1
2 6 1
2 7 0
2 8 0
2 9 0
2 10 0
3 1 0
3 2 1
3 3 0
3 4 0
3 5 0
3 6 0
3 7 0
3 8 0
3 9 0
3 10 0
What I would like to do is add a column high_count
that counts the number of times each syllable was perceived as high-pitched. For syllable 1 (in syll
), for instance, none of the three participants (in id
) marked it as high-pitched (in high
), so the value in the new column would be 0. For syllable 2, two of the participants (#2 and #3) marked it as high-pitched, so the value in the new column would be 2. These high_count
values need to iterate for each row the unique syllable appears on (i.e., once per participant). Here is how it should come out looking:
id syll high high_count
1 1 0 0
1 2 0 2
1 3 1 2
1 4 0 0
1 5 0 1
1 6 0 1
1 7 0 0
1 8 0 0
1 9 0 0
1 10 0 0
2 1 0 0
2 2 1 2
2 3 1 2
2 4 0 0
2 5 1 1
2 6 1 1
2 7 0 0
2 8 0 0
2 9 0 0
2 10 0 0
3 1 0 0
3 2 1 2
3 3 0 2
3 4 0 0
3 5 0 1
3 6 0 1
3 7 0 0
3 8 0 0
3 9 0 0
I have looked at other posts (here and here) but they don't seem to quite address my situation.
Using this code:
high_counts <- df %>% count(syll, high=factor(high))
I was able to get R to summarize the number of 0 and 1 counts for each syllable:
syll high n
1 1 0 30
2 1 1 8
3 2 0 15
4 2 1 23
5 3 0 29
6 3 1 9
7 4 0 36
8 4 1 2
9 5 0 33
10 5 1 5
... etc.
But I haven't been able to get this into a new column in my dataframe. I would also only want to keep the "1" values.
Let me know if I can clarify anything.
The columns id and syll need to be categorical, and high needs to be binary/numeric.
library(tidyverse)
df <- df |> mutate(across(-high, as.factor), # note: if, for example, the number of factor levels in the data is larger than the number in the actual dataset (i.e. there are options which haven't been chosen in the data), you can use factor() instead, and set the levels manually.
high = as.logical(high))
To get high_count:
mutate(df, high_count = sum(high), .by = syll)
To get high_counts as a column:
mutate(df, n = n(), .by = c(syll, high))
To filter to just ones where high == 1 (i.e. TRUE):
filter(df, high)
To get a column which shows the total number of high syllables for each syllable, you can do:
left_join(df,
summarise(df, high_count = sum(high), .by = syll))