Search code examples
rdataframedata-cleaning

Make r ignore the order at which values appear in a column (created from pasting multiple columns)


Given a variable x that can take values A,B,C,D

And three columns for variable x:

df1<- 
rbind(c("A","B","C"),c("A","D","C"),c("B","A","C"),c("A","C","B"), c("B","C","A"), c("D","A","B"), c("A","B","D"), c("A","D","C"), c("A",NA,NA),c("D","A",NA),c("A","D",NA))

How do I make column indicating the combination of in the three preceding column such that permutations (ABC, ACB, BAC) would be considered as the same combination of ABC, (AD, DA) would be considered as the same combination of AD?

Pasting the three columns with apply(df1,1,function(x) paste(x[!is.na(x)], collapse=", ")->df1$x4 and using df1%>%group(x4)%>%summarize(c=count(x4)) would count AD,DA as different instead of the same.

Edited title

My desired result would be to get a<-cbind(c("ABC",4),c("ACD",2),c("ABD",2),c("A",1),c("AD",2))

Someone already solved my question. Thanks


Solution

  • You can apply function paste after sorting each row vector.

    df1 <- 
      cbind(df1, apply(df1, 1, function(x) paste(sort(x), collapse = "")))
    
    df1
    #      [,1] [,2] [,3] [,4] 
    # [1,] "A"  "B"  "C"  "ABC"
    # [2,] "A"  "D"  "C"  "ACD"
    # [3,] "B"  "A"  "C"  "ABC"
    # [4,] "A"  "C"  "B"  "ABC"
    # [5,] "B"  "C"  "A"  "ABC"
    # [6,] "D"  "A"  "B"  "ABD"
    # [7,] "A"  "B"  "D"  "ABD"
    # [8,] "A"  "D"  "C"  "ACD"
    # [9,] "A"  NA   NA   "A"  
    #[10,] "D"  "A"  NA   "AD" 
    #[11,] "A"  "D"  NA   "AD"
    

    You can now simply table the column, with no need for an external package to be loaded and more complex pipes.

    table(df1[, 4])
    #A ABC ABD ACD  AD 
    #1   4   2   2   2