Search code examples
rstring-matchingdelete-row

Delete rows with matching words in the same column, and matching values in multiple columns


I have a data frame with over 20000 rows (data3) with a Column named "collector". In this column I have strings of words, for example: "Ruiz Galvis Marta". I need to compare each row with all other rows in my data frame, and delete those rows in which one or more than one word in the column df$collector matches with the words in the same column in all other rows, and with the value in column "sample" and column "number". That is:

INPUT:

Collector                   Times     sample   number
Ruiz Galvis Marta            9         SP.1      one        
Smith et al Marta            8         SP.2      two
Ruiz Andres Allan            4         SP.1      one


EXPECTED OUTPUT

Collector                   Times     sample    number           
Smith et al Marta             8         SP.2      two

Thanks for any help!


Solution

  • Probably going to be slow as hell but

    dd <- data.frame(Collector = c('Ruiz Galvis Marta', 'Smith et al Marta', 'Ruiz Andres Allan'),
                     stringsAsFactors = FALSE)
    
    ## create a matrix with the words by column
    tt <- strsplit(dd$Collector, '\\s+')
    mm <- do.call('rbind', lapply(tt, `length<-`, max(lengths(tt))))
    
    ## remove all duplicates
    dd[rowSums(apply(mm, 2, function(x)
      duplicated(x) | duplicated(x, fromLast = TRUE))) == 0, ]
    
    # [1] "Smith et al Marta"