Search code examples
rstringoptimizationgrepl

R There is a package that reduce/filter multiple strings to the shortest string? Optimization of script


I searched in forums and manuals about a script to "filter strings without deleting the string when it's alone". I have a script that works but i want to know if there is a way to make it faster.

What i did is:

library(readr)
dictio<-read_csv("Diccio.csv")
filter<-read_csv("Diccio.csv")

#Dictio and filters are the same df and look like this

| words           | 
| ----------------| 
| methyl          | 
| propyl          | 
| isopentil-oxane | 
| methylated-oxane| 
| isopropylan     | 

for (h in seq_along(filter$words)){
  for (k in seq_along(dictio$words)){
      if(grepl(filter$words[h], dictio$words[k], fixed = TRUE)==TRUE & 
      str_length(dictio$words[k])!=str_length(filter$words[h])){
      dictio$words[k]<-NA}
rm(h,k,filter)

# I use FIXED=TRUE because grepl give an error if the string to check > string im checking
# str_length its to avoid deletion of the perfect match of the string

dictio <- na.omit(dictio)

Then dictio look like this:

| words           | 
| ----------------| 
| methyl          | 
| propyl          | 
| isopentil-oxane | 

The last 2 rows string match the 2 first rows (no exact perfect match). Then, there is a package in R or a way to do this faster? Right now im working with a few thousand of rows, but i will need to work with more.


Solution

  • I am not aware of a package, but it should be fairly fast, even in base R. One way to speed up the computation would be to not check words that have already been dropped. Sorting the words might also help: only shorter words can make longer words redundant, not the order way around; so start with the short words.

    Here is a sketch.

    words <- c("methyl",
               "propyl",
               "isopentil-oxane",
               "methylated-oxane",
               "isopropylan")
    
    
    ## drop duplicates
    words <- unique(words)
    
    ## sort by word-length, then alphabet
    words <- words[order(nchar(words), words)]
    
    dropped <- logical(length(words))
    one_to_n <- seq_along(words)
    
    for (i in one_to_n) {
    
        ## which words to check: those that are
        ## longer/later and have not been dropped yet
        to_check <- i < one_to_n & !dropped
    
        ## drop words in which the current
        ## word is contained
        dropped[to_check][grepl(words[i], words[to_check], fixed = TRUE)] <- TRUE
    }
    

    Finally:

    words[!dropped]
    ## [1] "methyl"          "propyl"          "isopentil-oxane"