Search code examples
rdplyrtidyversesummarize

Is there a more efficient way to obtain variance of lot's of columns than dplyr?


I have a data.frame that is >250,000 columns and 200 rows, so around 50 million individual values. I am trying to get a breakdown of the variance of the columns in order to select the columns with the most variance.

I am using dplyr as follows:

df %>% summarise_if(is.numeric, var)

It has been running on my imac with 16gb of RAM for about 8 hours now.

Is there a way top allocate more resources to the call, or a more efficient way to summarise the variance across columns?


Solution

  • You may try using data.table which is usually faster.

    library(data.table)
    
    cols <- names(Filter(is.numeric, df))
    setDT(df)
    df[, lapply(.SD, var), .SDcols = cols]
    

    Another approach you can try is getting the data in long format.

    library(dplyr)
    library(tidyr)
    
    df %>%
      select(where(is.numeric)) %>%
      pivot_longer(cols = everything()) %>%
      group_by(name) %>%
      summarise(var_value = var(value))
    

    but I agree with @Daniel V that it is worth checking the data as 8 hours is way too much time to perform this calculation.