Search code examples
rsubstringsapply

R - finding pattern in a column and replacing it (more efficient solution)


I have a large data set of passengers per route similar to the following:

routes <- c("MEX-GDL", "ACA-MEX", "CUN-MTY", "MTY-CUN", "GDL-MEX", "MEX-ACA")
pax <- sample(100:500, size = 6, replace = T)
traffic <- data.frame(routes = routes, pax = pax)

   routes pax
   1 MEX-GDL 282
   2 ACA-MEX 428
   3 CUN-MTY 350
   4 MTY-CUN 412
   5 GDL-MEX 474
   6 MEX-ACA 263

I want to group flights if the origin and destination match as to get the total number of passengers in the route - so for example renaming the route MEX-GDL as GDL-MEX or viceversa so I can then use group_by() on the data set.

Kind of like this:

traffic %>% group_by(routes) %>% summarise(sum(pax)) 

I have done the following and it works, but I believe there can be a more efficient way to solve the problem (as it takes quite some time to run):

library(tidyverse)

traffic$routes <- as.character(traffic$routes)

for(route in traffic$routes){
  a <- substring(route, first = 1, last = 3) 
  b <- substring(route, first = 5, last = 7)
  aux <- which(sapply(traffic$routes, str_detect, pattern = paste0(b,"-",a)))
  traffic$routes[aux] <- paste0(a,"-",b)
}

Any suggestions?

Thanks for the help!

Note: it's my first question here, so I hope I complied with all the guidelines.


Solution

  • We can separate into two columns, grouped by the pmax or pmin, get the sum

    library(tidyverse)
    traffic %>% 
       separate(routes, into = c("Col1", "Col2")) %>%
       group_by(ColN = pmin(Col1, Col2), ColN2 = pmax(Col1, Col2)) %>% 
       summarise(Sum = sum(pax))