I have a data frame with over 20000 rows (data3) with a Column named "collector". In this column I have strings of words, for example: "Ruiz Galvis Marta". I need to compare each row with all other rows in my data frame, and delete those rows in which one or more than one word in the column df$collector matches with the words in the same column in all other rows, and with the value in column "sample" and column "number". That is:
INPUT:
Collector Times sample number
Ruiz Galvis Marta 9 SP.1 one
Smith et al Marta 8 SP.2 two
Ruiz Andres Allan 4 SP.1 one
EXPECTED OUTPUT
Collector Times sample number
Smith et al Marta 8 SP.2 two
Thanks for any help!
Probably going to be slow as hell but
dd <- data.frame(Collector = c('Ruiz Galvis Marta', 'Smith et al Marta', 'Ruiz Andres Allan'),
stringsAsFactors = FALSE)
## create a matrix with the words by column
tt <- strsplit(dd$Collector, '\\s+')
mm <- do.call('rbind', lapply(tt, `length<-`, max(lengths(tt))))
## remove all duplicates
dd[rowSums(apply(mm, 2, function(x)
duplicated(x) | duplicated(x, fromLast = TRUE))) == 0, ]
# [1] "Smith et al Marta"