Search code examples
rtextdata-cleaning

text cleaning in R


I have a single column in R that looks like this:

Path Column
ag.1.4->ao.5.5->iv.9.12->ag.4.35
ao.11.234->iv.345.455.1.2->ag.9.531

I want to transform this into:

Path Column
ag->ao->iv->ag
ao->iv->ag

How can I do this?

Thank you

Here is my full dput from my data:

structure(list(Rank = c(10394749L, 36749879L), Count = c(1L, 
1L), Percent = c(0.001011122, 0.001011122), Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed", 
"ao.legacy payment.not_completed->agent.payment.completed")), .Names = c("Rank", 
"Count", "Percent", "Path"), class = "data.frame", row.names = c(NA, 
-2L))

Solution

  • You could use gsub to match the . and numbers following the . (\\.[0-9]+) and replace it with ''.

     df1$Path.Column <- gsub('\\.[0-9]+', '', df1$Path.Column)
     df1
     #           Path.Column
     #1 ag -> ao -> iv -> ag
     #2       ao -> iv -> ag
    

    Update

    For the new dataset df2

    gsub('\\.[^->]+(?=(->|\\b))', '', df2$Path, perl=TRUE)
    #[1] "ao->ao->ao" "ao->agent" 
    

    and for the string showed in the OP's post

    str2 <- c('ag.1.4->ao.5.5->iv.9.12->ag.4.35',
        'ao.11.234->iv.345.455.1.2->ag.9.531')
    
    gsub('\\.[^->]+(?=(->|\\b))', '', str2, perl=TRUE)
     #[1] "ag->ao->iv->ag" "ao->iv->ag"    
    

    data

    df1 <- structure(list(Path.Column = c("ag.1 -> ao.5 -> iv.9 -> ag.4", 
    "ao.11 -> iv.345 -> ag.9")), .Names = "Path.Column", 
    class = "data.frame", row.names = c(NA, -2L))
    
    df2  <- structure(list(Rank = c(10394749L, 36749879L), Count = c(1L, 
    1L), Percent = c(0.001011122, 0.001011122), 
    Path = c("ao.legacy payment.not_completed->ao.legacy payment.not_completed->ao.legacy payment.completed", 
    "ao.legacy payment.not_completed->agent.payment.completed")), 
    .Names = c("Rank", "Count", "Percent", "Path"), class = "data.frame", 
    row.names = c(NA, -2L))