Search code examples
applyentity-relationshipgsubregex-groupregex-greedy

gsub Find and replace text between two strings in R


i have to use a gsub function in 1000 rows of a column containing text. In every row i want to remove every word occuring between "said:" & "click to expand..." as they are just copy of previous tweet. I am trying to use gsub for accomplishing my task

content2<-as.data.frame(gsub(".*said:(.*?)expand.... *", " ", content2$txt,fixed=TRUE),stringsAsFactors = FALSE);

But, it is deleting only "said:" and expand. The content2 is 100 observation of 1 variable data frame and I have to do the task for every row. after wiktor response i tried to see if the line he wrote is working or not. i still can see said: and click to expand.... in row 35 so , the code by wiktor is working only for first row i guess (which anyway don't contain the the lines to be deleted). I tried to use apply unsuccessfully to apply this to every row as follows , b3esides being too slow , it's giving me other error :

ops<-apply(content2,1,gsub("(said:).*?(click to expand\\.{3})", "\\1 \\2", content2,fixed=TRUE))

Just looked through the duplicate post , it does'nt answer my question, which is : What should i do if i wanna replacing all characters between the pattern let say i wanna replace all strings between"said:" and "click to expand" for all rows of a 100X1 data frame. all rows contains set of strings and output should be dataframe of dimensions 100X1: ops<-gsub("(said:).*?(click to expand\\.{3})", "\\1 \\2", test)

@WiktorStribiżew thanks it seems to work only issue i also wanna remove "said:" & "click to expand..." I made the following reproducible code you can see "said:" & "click to expand..." are not getting removed.

test<-as.data.frame(c("he said: i wanna be a rockstar click to expand....ok great but how you gonna do it", 
                      "rockstar said: so how you gonna do it click to expand.... we are wanna be a big rockstar, hang out in collest bar vip with movie star"),stringsAsFactors=FALSE)
ops<-lapply(test, gsub, pattern = '(said:).*?(click to expand\\.{3})', replacement ="\\1 \\2", perl=TRUE)
ops<-as.data.frame(ops,stringsAsFactors = FALSE)

Solution

  • To remove all substrings spanning from said: till the leftmost click to expand... from all the columns in a dataframe you may use

    content2[] <- lapply(content2, gsub, pattern = '(?s)said:.*?click to expand\\.{3}', replacement =" ", perl=TRUE)
    

    The PCRE regex (note the perl=TRUE enables the PCRE engine) matches:

    • (?s) - enables . to match line break chars (it does not by default)
    • said: - a string (to match it as a whole word add \b in front)
    • .*? - any 0+ chars, as few as possible
    • click to expand\.{3} - click to expand... substring (\.{3} matches a . char thrice).