I'm a beginner in R and am trying to create a new column that will have values determined by values of rows in two columns.
My data frame looks something like this:
df <- data.frame(subjectid = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3), subj_day = c("1_1","1_1","1_1","1_2","1_2","2_1","2_1","2_1","2_2","2_2","3_1","3_1","3_1","3_2","3_2"), done_meditation = c(0,0,1,1,1,0,1,1,0,0,1,1,1,1,1))
> df
subjectid subj_day done_meditation
1 1 1_1 0
2 1 1_1 0
3 1 1_1 1
4 1 1_2 1
5 1 1_2 1
6 2 2_1 0
7 2 2_1 1
8 2 2_1 1
9 2 2_2 0
10 2 2_2 0
11 3 3_1 1
12 3 3_1 1
13 3 3_1 1
14 3 3_2 1
15 3 3_2 1
This data is from a multi-day study, and participants provided multiple responses each day. Each row reflects a response.
The subj_day column shows the participant number and the day of the study for that participant at the time of the response reflected in that row.
The values in 'done_meditation' refer to whether the participant had completed a meditation on that day at the time of their response. If a response was provided after completing a meditation, done_meditation = 1. So, if a participant completed a meditation before any responses on that day, all their done_meditation rows for that day would = 1.
However, if they provided some responses before completing the meditation, and some after, only some of the done_meditation rows would = 1.
I need to create a column that will provide the number of meditations a participant completed across the entire study.
So, the new data frame might look like this:
> df
subjectid subj_day done_meditation total_meditations
1 1 1_1 0 2
2 1 1_1 0 2
3 1 1_1 1 2
4 1 1_2 1 2
5 1 1_2 1 2
6 2 2_1 0 1
7 2 2_1 1 1
8 2 2_1 1 1
9 2 2_2 0 1
10 2 2_2 0 1
11 3 3_1 1 2
12 3 3_1 1 2
13 3 3_1 1 2
14 3 3_2 1 2
15 3 3_2 1 2
Because participant 2 only completed a meditation on 1 of the 2 study days, they receive a score of 1 in every row.
Because participants 1 and 3 completed a meditation on two days of the study, they receive a score of 2 in every row.
It might be useful to know that the actual study was 7 days long, so participants could have a total_meditations score of up to 7.
I have tried to write a few loops without success and have tried to use the group_by and mutate functions in dplyr but without success. I'm very new to R and have been struggling to figure this one out.
Thanks so much in advance for any help!
UPDATE: I think I've found a solution that works. The below seems to do the job. Still open to any new suggestions if someone has a more elegant solution :)
df2 <- df %>% group_by(subjectid, subj_day) %>%
summarise(meditationCount = sum(done_meditation)) %>%
mutate(meditationCount = ifelse(meditationCount >= 1, 1, 0)) %>%
group_by(subjectid) %>%
summarise(total_meditations = sum(meditationCount))
merged.df <- merge(df,df2, by = "subjectid")
Good code in constructing your own answer! A few small suggestions for condensed code:
max
to get a 1 or 0 dependend on day instead of sum
/ifelse
summarise
to drop the subj_day
group for you using .groups = "drop_last"
so no need for a second group_by
call.magrittr
%>%
pipe also works)df |>
group_by(subjectid, subj_day) |>
summarise(mediated = max(done_meditation), .groups = "drop_last") |>
summarise(total_meditations = sum(mediated)) |>
left_join(df, y = _, by = "subjectid")
# Or use `%>% left_join(df, y = ., ...)` with magrittr pipe
#> subjectid subj_day done_meditation total_meditations
#> 1 1 1_1 0 2
#> 2 1 1_1 0 2
#> 3 1 1_1 1 2
#> 4 1 1_2 1 2
#> 5 1 1_2 1 2
#> 6 2 2_1 0 1
#> 7 2 2_1 1 1
#> 8 2 2_1 1 1
#> 9 2 2_2 0 1
#> 10 2 2_2 0 1
#> 11 3 3_1 1 2
#> 12 3 3_1 1 2
#> 13 3 3_1 1 2
#> 14 3 3_2 1 2
#> 15 3 3_2 1 2
P.S. On SO you are allowed to answer your own question, so if you're satisfied with your own solution do post it as an answer and you can accept it to give yourself the well-deserved credit!