Search code examples
rdplyrtibble

Indexing data frame takes too long


I've got some code looking like this:

library(stringi)

df_values <- data.frame(value = stri_rand_strings(n = 500,
                                                  length = 30))

df_keys <- tibble(key = sample(x = 1:500,
                               size = 25000,
                               replace = TRUE))

# start timer
start_time <- Sys.time()

df_keys |>
 rowwise() |>
 mutate(value = df_values$value[key])

# end timer
end_time <- Sys.time()

end_time - start_time

Which requires very much time to run, but I can't figure out why. The code above only requires 0.3003931 seconds. For my real code I subsetted the tibble with head(n) and got following times:

n time in secs
50 1.993536
100 3.731
200 6.550074
300 9.500864
500 15.68515
1,000 32.19306
... seems to be linear
20,000 maybe 10 minutes

Does someone have an idea what could be wrong with my code? I guess it's the indexing-part df_values$value[key]? But my original df_values also is a data.frame with 500 obs.


Solution

  • A possible solution, in base R. As we can see, the execution time takes only 1% of the time, compared to your dplyr approach. Even removing rowwise, the execution time is extremely faster with a base R approach.

    library(tidyverse)
    library(stringi)
    
    # start timer
    start_time <- Sys.time()
    
    df_keys |>
      rowwise() |>
      mutate(value = df_values$value[key])
    #> # A tibble: 25,000 × 2
    #> # Rowwise: 
    #>      key value                         
    #>    <int> <chr>                         
    #>  1   287 BeFLZsuRxlKJAJLgOnH1SO2f6kjpPH
    #>  2   292 yG1JoxKRzSDnBlk4fJKDcKwzAUGwOy
    #>  3   334 38pJ1h3RaTTSDgcf7gyCuW2NqFyncZ
    #>  4   120 LqqCmTiMQV50hV0c0yYzk94AtpV7I6
    #>  5   233 62BsX6NAEQqYx5wjm5ienCYgDmvJDb
    #>  6   413 OB2MqTt1SOTb3irKlLEBtr4MfvuWW5
    #>  7   123 4IKKUTli7c1l8GwU8TTpWHLHirGCy8
    #>  8   400 aDnB9PwIKQkdfAW5kwzM215vU9aCNk
    #>  9   214 aOsJkVENbncaHESiU2rwmfXqY5yVsK
    #> 10   332 v4DfYVOr9kedtIwnWFlefDfFhHJ25R
    #> # … with 24,990 more rows
    
    # end timer
    end_time <- Sys.time()
    
    end_time - start_time
    
    #> Time difference of 0.1876147 secs
    
    start_time <- Sys.time()
    df_keys$value <- df_values$value[df_keys$key]
    end_time <- Sys.time()
    
    end_time - start_time
    
    #> Time difference of 0.002212286 secs