Search code examples
rdataframelisttidyversedimensions

Select the data.frame with maximum dimensions from a list of data.frames


I have a list of data.frames. Each data.frame within the list has different dimensions (different number of rows and columns). I would list to select the data.frame from the list that has the most rows AND the most columns (i.e. the biggest dimensions overall).

I would like a tidyverse solution and in particular am trying to use purrr::keep() - however I am getting stuck as I can't figure out how to make the index corresponding to the maximum dimensions into a logical condition.

I can do it in two steps, but it seems a bit repetitive and wondering if there is any way to do this in a single line, all within the purrr::keep() function. Speed/efficiency is important too as in my real data there would be almost a hundred different data.frames to select from.

Here is what I tried (2-step version):


# First get the size of the data.frame with the biggest dimensions in the list:
maxdim = 1:length(dflist) %>% 
  map_vec(~ reduce(dim(dflist[[.x]]), `*`)) %>% 
  max()

# Now select the data.frame from the list that matches maxdim:
dfinal = purrr::keep(dflist, ~ reduce(dim(.), `*`) == maxdim)[[1]]

Note I had to use reduce() with dim() as I couldn't find a function that would give me the dimensions of a data.frame as a single number (i.e. number of rows multiplied by number of columns). It seems reasonably fast but would also like to know if there is a dedicated function for this. I also tried length(unlist(df) but don't know if that is any faster.

Here is some code for making the example list of data.frames:

# Create example data:
df <- data.frame(
  id = c(1,2,3,4,5,6,7,8,9,10),
  c1 = c("a", "a", "c", NA, "c", "d", "c", NA, "a", "b"),
  c2 = c(25, NA, 17, 5, 50, 43, 21, 2, 1, NA), 
  c3 = c(NA, "s", "r", NA, "r", "i", NA, "r", NA, NA),
  c4 = c(1.0, 5.3, 2.9, NA, 6.1, NA, 2.5, 4.3, 9.1, 2.4),
  c5 = c(5, 6, NA, 3, 1, 6, 7, 8, 2, 1)
)

# Make a vector of columns to iteratively drop:
cols2drop <- c("c2", "c3", "c4", "c5")

# Create the list of data.frames (subsets of the original) of different sizes:
dflist = cols2drop %>%
    map(~ df %>% select(1:.x) %>% drop_na())


Solution

  • dflist[[which.max(map_dbl(dflist, ~prod(dim(.x))))]]
    
      id c1 c2
    1  1  a 25
    2  3  c 17
    3  5  c 50
    4  6  d 43
    5  7  c 21
    6  9  a  1
    

    You could invoke log and do summation instead of multiplication:

     dflist[[which.max(colSums(log(sapply(dflist, dim))))]]