Search code examples
rperformanceaggregatedata-cleaning

Aggregating binary columns into one column is taking a long time in R


I am running a piece of code that was discussed earlier here on a large data with 1.5 million rows and it's been taking hours to run and not done yet. My data looks like this:

ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No

and I'd like to add a column that shows all cities that an ID has travelled to as well as a column showing the number of cities that an ID has travelled to as the following:

ID    London   Paris   Rome    All Cities      Count of Cities travelled
1       Yes     No      Yes    London, Rome                2
2       No      No      Yes     Rome                       1
3       No      Yes     Yes    Paris, Rome                 2
4       No      Yes     No     Paris                       1

and I am running this code that works fine when I run it on a sample of 100 rows of data :

cities <- c('London', 'Paris', 'Rome')

df %>%
  rowwise %>%
  mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
         `Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))

is there any way to improve this code? or make the running time shorter?

Thank you!


Solution

  • Here is a tidyverse approach without using rowwise(), which is known to be terribly slow.

    library(tidyverse)
    
    cities <- c('London', 'Paris', 'Rome')
    df <- read.table(header = T, text = "ID    London   Paris   Rome
    1       Yes     No      Yes
    2       No      No      Yes
    3       No      Yes     Yes
    4       No      Yes     No")
    
    df %>% 
      mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>% 
      unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>% 
      mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
    
      ID London Paris Rome   All Cities Count of Cities travelled
    1  1    Yes    No  Yes London, Rome                         2
    2  2     No    No  Yes         Rome                         1
    3  3     No   Yes  Yes  Paris, Rome                         2
    4  4     No   Yes   No        Paris                         1