Search code examples
rdata-quality

data preprocessing in R for removing duplicate in a string


I am doing data preprocessing and am stuck at a problem.I have data like Telma 2525 mg tablet. I want it to be converted to Telma 25 mg tablet.Can this be done?

Thanks


Solution

  • gusb()

    > x<-rep("Telma 2525 mg tablet",10)
    > x
    [1] "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet"
    [6] "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet" "Telma 2525 mg tablet"
    
    > gsub("Telma 2525 mg tablet","Telma 25 mg tablet",x)
    
    [1] "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet"
    [6] "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet" "Telma 25 mg tablet"
    

    where x is your data source

    EDIT - UPDATED TO MAKE IT GENERIC

    d<-data.frame(t=c("blah blah 2525 mg", "blah blah 7272 mg"),stringsAsFactors=F)
    
    remdup<-function(s){
    f<-regexec("[0-9]{4}",s)[[1]][1] # find the start point for 4 digits in a row 
    sub(substr(s,f,f+1),"",s)        # remove the first match of the first 2 digits
    }
    
    lapply(d$t,FUN=function(x)remdup(x))
    
    #[[1]]
    #[1] "blah blah 25 mg"
    #  
    #[[2]]
    #[1] "blah blah 72 mg"