Search code examples
rdataframefor-loopunique

Remove duplicate observations between columns of a specific row


This is a short example of the dataframe that I am trying to clean:

L3 <- LETTERS[1:5]    
fac<-c("fish", "meat", "chicken", "veg", "shrimp")

set.seed(1)
(d <- data.frame(code = sample(c(11:15)), 
      upc = sample(c(1:5)), desc = sample(fac), 
      desc1 = fac, desc2 = sample(fac), 
      desc3 = fac, desc4 = sample(fac) ))


  code upc    desc   desc1   desc2   desc3   desc4
1   12   5    meat    fish chicken    fish  shrimp
2   15   4    fish    meat  shrimp    meat    fish
3   14   2 chicken chicken     veg chicken    meat
4   13   3     veg     veg    fish     veg     veg
5   11   1  shrimp  shrimp    meat  shrimp chicken

I am trying to write a general function (using a for loop and unique()) that verifies the entries from column 3 to 7 independently for each row and that keeps a unique value that is not repeated in the other columns (i.e. : if a row contains fish in all desc columns the new row should only contain fish in one column). More specifically, the desired outcome is:

  code upc    desc desc1   desc2 desc3   desc4
1   12   5    meat  fish chicken        shrimp
2   15   4    fish  meat  shrimp              
3   14   2 chicken           veg          meat
4   13   3     veg          fish              
5   11   1  shrimp          meat       chicken


Solution

  • We can use duplicated to assign those elements that are duplicates in each row to blank "" for the 'desc' columns

    nm1 <- grep('desc', names(d))
    d[nm1] <- t(apply(d[nm1], 1, function(x) {replace(x, duplicated(x), "")}))
    d
    #  code upc    desc desc1   desc2 desc3   desc4
    #1   12   5    meat  fish chicken        shrimp
    #2   15   4    fish  meat  shrimp              
    #3   14   2 chicken           veg          meat
    #4   13   3     veg          fish              
    #5   11   1  shrimp          meat       chicken
    

    Or using a for loop (assuming the columns are character class or have blank as one of the levels before doing the assignment)

    for(i in seq_len(nrow(d))) d[i, nm1] <- replace(d[i, nm1], 
                                         duplicated(unlist(d[i, nm1])), '')