Search code examples
rsqlitesplitlarge-data

How to speed up splitting a column into a thoursand columns?


I need to analyse a large dataset. For that, I need to separate a character variable into more than a thousand columns. The structure of this variable is number$number$number$ and so on for a thousand numbers.

My data is stored in an SQLite database. I imported it in R using RSQLite. I tried splitting this column into multiple columns using dplyr :

#d is a data.table with my data

d2=d %>% separate(column_to_separate, paste0("S",c(1:number_of_final_columns)))

It works but is taking forever. How to split this column faster (either in R or using SQLite)?


Solution

  • You may use the tidyfast package (see here), that leverages on data.table. In this test, it is approximately three times faster:

    test <- data.frame(
      long.var = rep(paste0("V", 1:1000, "$", collapse = ""), 1000)
    )
    system.time({
      test |> 
        tidyr::separate(long.var, into = paste0("N", 1:1001), sep="\\$")
    })
    #>    user  system elapsed 
    #>   0.352   0.012   0.365
    
    system.time({
      test |> 
        tidyfast::dt_separate(long.var, into = paste0("N", 1:1001), sep="\\$")
    })
    #>    user  system elapsed 
    #>   0.117   0.000   0.118
    

    Created on 2023-02-03 with reprex v2.0.2