Search code examples
rgroupingset-difference

setdiff() on grouped variables


I want to compare groups of variables to an external list (vector "Valid_codes").

Consider the following example

ID<-c("Carl", "Carl","Carl","Peter","Peter","Peter")
Question<-c("need","need","need","dyadic","dyadic","dyadic")
Image<-c("image1","image1","image1","image2","image2","image2")
V1<-c("A1","A2","C0","A3","A3","A1")
df<-data.frame(ID,Question,Image,V1)
df
Valid_codes<-c("A1","A2","A3","A4")

I want to get an output like the below, where V1 has been grouped based on ID and Question, each group compared to the vector Valid_codes, and the difference written in a new column (MissingCodes, that is: valid codes not used in group). Image no. should just be transferred from original as it is uniform across the groups.

Grouped(ID,Question) Image MissingCodes (setdiff())
Joel_need Image1 A3, A4
Peter_dyadic Image2 A2, A4

I am new to data wangling, have used setdiff() on the full dataset, but am having troubles, when wanting to do it on grouped data. The actual data set contains app. 40,000 thousand rows.

df%>%
group_by(ID,Question, across(Image))%>%
mutate(Missing_Codes=setdiff(Valid_codes,?))

Thanks a lot for any help!


Solution

  • This should do it:

    df |>
      summarize(present_codes = list(V1), .by = c(ID, Question, Image)) |>
      mutate(missing_codes = lapply(present_codes, setdiff, x = Valid_codes))
    #      ID Question  Image present_codes missing_codes
    # 1  Carl     need image1    A1, A2, C0        A3, A4
    # 2 Peter   dyadic image2    A3, A3, A1        A2, A4
    

    Note that the present_codes and missing_codes are list class columns, not character vectors. (Each row of present_codes and missing_codes is a character vector, rather than the column being a character vector.) This should help for flexibility later, but if you want to convert them you can add, e.g., ... |> mutate(missing_codes = sapply(missing_codes, toString)).

    A little tip - when you say "Image no. should just be transferred from original as it is uniform across the groups", you should just include it in the grouping. When you're summarizing data down to 1 row per group, there's no real way to "bring something along" - it's either (a) part of the grouping, (b) it needs a summary function to collapse it to one value, or (c) it will be dropped.