I want to compare groups of variables to an external list (vector "Valid_codes").
Consider the following example
ID<-c("Carl", "Carl","Carl","Peter","Peter","Peter")
Question<-c("need","need","need","dyadic","dyadic","dyadic")
Image<-c("image1","image1","image1","image2","image2","image2")
V1<-c("A1","A2","C0","A3","A3","A1")
df<-data.frame(ID,Question,Image,V1)
df
Valid_codes<-c("A1","A2","A3","A4")
I want to get an output like the below, where V1 has been grouped based on ID and Question, each group compared to the vector Valid_codes, and the difference written in a new column (MissingCodes, that is: valid codes not used in group). Image no. should just be transferred from original as it is uniform across the groups.
Grouped(ID,Question) | Image | MissingCodes (setdiff()) |
---|---|---|
Joel_need | Image1 | A3, A4 |
Peter_dyadic | Image2 | A2, A4 |
I am new to data wangling, have used setdiff() on the full dataset, but am having troubles, when wanting to do it on grouped data. The actual data set contains app. 40,000 thousand rows.
df%>%
group_by(ID,Question, across(Image))%>%
mutate(Missing_Codes=setdiff(Valid_codes,?))
Thanks a lot for any help!
This should do it:
df |>
summarize(present_codes = list(V1), .by = c(ID, Question, Image)) |>
mutate(missing_codes = lapply(present_codes, setdiff, x = Valid_codes))
# ID Question Image present_codes missing_codes
# 1 Carl need image1 A1, A2, C0 A3, A4
# 2 Peter dyadic image2 A3, A3, A1 A2, A4
Note that the present_codes
and missing_codes
are list
class columns, not character vectors. (Each row of present_codes
and missing_codes
is a character vector, rather than the column being a character vector.) This should help for flexibility later, but if you want to convert them you can add, e.g., ... |> mutate(missing_codes = sapply(missing_codes, toString))
.
A little tip - when you say "Image no. should just be transferred from original as it is uniform across the groups", you should just include it in the grouping. When you're summarizing data down to 1 row per group, there's no real way to "bring something along" - it's either (a) part of the grouping, (b) it needs a summary function to collapse it to one value, or (c) it will be dropped.