Search code examples
rmultiple-columnsunique-values

R: How to return unique values between columns with multiple values per cell?


I have a dataframe with multiple values per cell, and I want to find and return the values that are only in one column.

ID <- c("1","1","1","2","2","2","3","3","3")
locus <- c("A","B","C","A","B","C","A","B","C")
Acceptable <- c("K44 Z33 G49","K72 QR123","B92 N12 PQ16 G99","V3","L89 I203 UPA66 QF29"," ","K44 Z33 K72","B92 PQ14","J22 M43 VC78")
Unacceptable <- c("K44 Z33 G48","K72 QR123 B22","B92 N12 PQ16 G99","V3 N9 Q7","L89 I203 UPA66 QF29","B8","K44 Z33"," ","J22 M43 VC78")

df <- data.frame(ID,locus,Acceptable,Unacceptable)

dataframe

I want to make another column, Unique_values, that returns all the unique values that are only present in Unacceptable, and that are not in Acceptable. So the output should be this.

I already have a poorly optimized method to find the duplicates between the two columns:

df$Duplicate_values <- do.call(paste, c(df[,c("Acceptable","Unacceptable")], sep=" "))
df$Duplicate_values = sapply(strsplit(df$Duplicate_values, ' '), function(i)paste(i[duplicated(i)]))

#this is for cleaning up the text field so that it looks like the other columns
df$Duplicate_values = gsub("[^0-9A-Za-z///' ]"," ",df$Duplicate_values)
df$Duplicate_values = gsub("character 0",NA,df$Duplicate_values)
df$Duplicate_values = gsub("^c ","",df$Duplicate_values)
df$Duplicate_values = gsub("    "," ",df$Duplicate_values)
df$Duplicate_values = trimws(df$Duplicate_values)

(if anyone knows a faster method to return these duplicates, please let me now!)

I cannot use this method to find the unique values however, because it would then also return the unique values of the Acceptable column, which I do not want.

Any suggestions?


Solution

  • A similar approach using setdiff:

    lA <- strsplit(df$Acceptable, " ")
    lU <- strsplit(df$Unacceptable, " ")
    df$Unique_values <- lapply(1:nrow(df), function(i) paste0(setdiff(lU[[i]], lA[[i]]), collapse = " "))
    df
    #>   ID locus          Acceptable        Unacceptable Unique_values
    #> 1  1     A         K44 Z33 G49         K44 Z33 G48           G48
    #> 2  1     B           K72 QR123       K72 QR123 B22           B22
    #> 3  1     C    B92 N12 PQ16 G99    B92 N12 PQ16 G99              
    #> 4  2     A                  V3            V3 N9 Q7         N9 Q7
    #> 5  2     B L89 I203 UPA66 QF29 L89 I203 UPA66 QF29              
    #> 6  2     C                                      B8            B8
    #> 7  3     A         K44 Z33 K72             K44 Z33              
    #> 8  3     B            B92 PQ14                                  
    #> 9  3     C        J22 M43 VC78        J22 M43 VC78