Search code examples
rgsubstring-substitution

Writing a user-defined function that accepts an oldstring, searches a dataframe column, and replaces with a newstring


I have a dataset blah with a column kw. There are tens of thousands of strings, some of which are sentence-length. I already replaced the vast majority of what I want to replace with a for loop, replacing substrings with substring categories. However, I cannot possibly think of all the substrings that need replacing--while most of the heavy lifting is done, there are just a good amount of edge cases and I want to handle them as they arise.

I want to create a function cleanup where I can pass it an oldsubstring and a newsubstring and the function will replace instance of oldsubstring in blah$kw with newsubstring.

Here's what I've written so far:

cleanup <- function(oldstring, 
                    newstring) {
           blah$kw[grepl(oldstring, 
                         blah$kw)] <- sapply(blah$kw[grepl(oldstring, 
                                                           blah$kw)],
                                             function(x) gsub(oldstring,
                                                              newstring, 
                                                              x))
}

This may look stupid, I have no idea--I'm quite new to R. But I am basing it off of the one-off code I found, which is here:

blah$kw[grepl(oldstring, 
              blah$kw)] <- sapply(blah$kw[grepl("oldstring", 
                                                 blah$kw)],
                                  function(x) gsub("oldstring",
                                                   "newstring", 
                                                   x))
}

And which works just like a charm. Anyway, any help would be huge. Thanks!


Solution

  • It's typically best practice not to hardcode the data set to the function and pass it as a variable. What you're looking for can be accomplished via subsetting

    cleanup <- function(df1, oldstring, newstring) {
      df1[grepl(oldstring, df1)] <- gsub(oldstring, newstring, df1[grepl(oldstring, df1)])
      df1
    }
    
    blah$bw <- cleanup(blah$bw, "a", "y")
    

    Note: this will not work if your strings are stored as factors