I am running a piece of code that was discussed earlier here on a large data with 1.5 million rows and it's been taking hours to run and not done yet. My data looks like this:
ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No
and I'd like to add a column that shows all cities that an ID has travelled to as well as a column showing the number of cities that an ID has travelled to as the following:
ID London Paris Rome All Cities Count of Cities travelled
1 Yes No Yes London, Rome 2
2 No No Yes Rome 1
3 No Yes Yes Paris, Rome 2
4 No Yes No Paris 1
and I am running this code that works fine when I run it on a sample of 100 rows of data :
cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))
is there any way to improve this code? or make the running time shorter?
Thank you!
Here is a tidyverse
approach without using rowwise()
, which is known to be terribly slow.
library(tidyverse)
cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No")
df %>%
mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>%
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>%
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
ID London Paris Rome All Cities Count of Cities travelled
1 1 Yes No Yes London, Rome 2
2 2 No No Yes Rome 1
3 3 No Yes Yes Paris, Rome 2
4 4 No Yes No Paris 1