Search code examples
rdataframecombinations

R: extract inner higher level combinations (groups of 1, 2, 3, and 4 elements) out of a data frame of combinations of 5 elements


Sorry I have to post another question following up on this one and this other one.

While the answer to the second one addresses the MWE perfectly, in my real world data I need to do things differently, and wondered if someone could help.

So this time around, my starting point is a data frame (named plusminus_df) of combinations of 5 elements (in reality it can be 1 to n), of the following form:

markers=LETTERS[1:5]
plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
names(plusminus_df)=LETTERS[1:5]
head(plusminus_df)

  A B C D E
1 + + + + +
2 - + + + +
3 + - + + +
4 - - + + +
5 + + - + +
6 - + - + +

So it is just a dataframe of combinations of +/- for all the 5 markers (note this is a variable number). What I would need to do at this point, is to extract the inner higher level combinations of 1, 2, 3, and 4 markers (note these are variable numbers), preserving this same dataframe structure (in that sense, I would need to include NAs).

So my expected result would be something like this:

> final_df
      A    B    C    D    E
1     + <NA> <NA> <NA> <NA>
2     - <NA> <NA> <NA> <NA>
3     +    - <NA> <NA> <NA>
4     -    - <NA> <NA> <NA>
5     +    + <NA> <NA> <NA>
6     -    + <NA> <NA> <NA>
7     +    -    - <NA> <NA>
8     -    -    - <NA> <NA>
9     +    +    + <NA> <NA>
10    -    +    + <NA> <NA>
11    +    -    + <NA> <NA>
12    -    -    + <NA> <NA>
13    +    +    - <NA> <NA>
14    -    +    - <NA> <NA>
15    +    -    -    - <NA>
16    -    -    -    - <NA>
17    +    +    +    + <NA>
...
n     +    +    +    +    +
n+1   -    +    +    +    +
n+2   +    -    +    +    +
n+3   -    -    +    +    +
n+4   +    +    -    +    +
n+5   -    +    -    +    +
...

With all the possible combinations of 1 marker (+ and -), 2 markers, 3, 4, and 5 (as in the original), filling in the non-used markers with NA.

So the answer to the second question works well to build this desired final dataframe from scratch, just from the original markers vector. But in my real world case I can actually retrieve a filtered down list of 5 marker combinations in the form of the plusminus_df above... What would be the most straightforward and efficient way to obtain the desired dataframe from this one, without having to deal with messy nested loops?


Solution

  • I'm not completely certain I've understood what you're looking for, but from the second question it looks like you are looking for all cross-combinations of columns within a data.frame.

    Disclaimer: The two answers already provided are more readable, where I focus on speed.

    As you are performing what is often known as a cross-join (or outer-full-join) one aspect that quickly becomes a concern as n increases is efficiency. For efficiency it helps to split the problem into smaller sub-problems, and find a solution for each problem. As we need to find all unique combinations within the set of columns including the null set (value = NA), we can reduce this problem into 2 sub-problems.

    1. Find unique values for each column including the null set
    2. Expand this set to include all combinations of each column.

    Using this idea we can quickly concoct a simple solution using expand.grid, unique and lapply. The only tricky part is to include the null set, but we can do this by selecting NA row from the data.frame including all rows.

    # Create null-set-included data.frame
    nullset_df <- plusminus_df[c(NA, seq_len(nrow(plusminus_df))), ]
    # Find all unique elements, including null set
    unique_df <- lapply(nullset_df, unique)
    # Combine all unique sets
    expand.grid(unique_df)
    

    or as a function

    nullgrid.expand <- function(df, ...)
      expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
    

    This is fairly fast (benchmarks and performance graphs later), but I wanted to go one step further. The data.table package is known for it's high-performance functions, and one of those functions in the CJ function, for performing cross-joins. Below is one implementation using CJ

    library(data.table)
    nullgrid.expand.dt <- function(df, ...)
      do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                           sorted = FALSE,
                           unique = TRUE))
    

    The function requires vector input, forcing one to use do.call (or similar) which makes the function slightly less readable. But is there any performance gain? To test this, I ran a microbenchmark on the two functions and the ones provided by the existing answers (code below), the result is visualized in a violin plot below:

    microbenchmark-plot

    From the plot it is quite clear that @pauls answer outperforms @ekoam's answer, but the two functions above both outperform the provided answers in terms of speed. But the question said that the input might have any number of dimension, so there is also the question of how well our function scales with the number of columns and the number of unique values (here we only have "+" and "-" but what if we had more?). For this I reran the benchmark for n_columns = 3, 4, ..., 10 and n_values = 2, 4, ... 10. The 2 results are visualized with smooth curves below.
    First we'll visualize the time as a function of number of columns. Note that the y axis is on logarithmic scale (base 10) for easier comparison.

    benchmark-column-count

    From the visualization it is quite clear that, with increasing number of columns, the choice of method becomes very important. The suggestion by @ekoam becomes very slow, primarily because it delays a call to unique till the very end. The remaining 3 methods are all much faster, while nullgrid.expand.dt becomes more than 10 times faster in comparison to the remaining methods once we get more than 8 columns of data.

    Next lets look at the timing compared to the number of values in each column (n-columns fixed at 5)

    benchmark-value-count

    Again we see a similar picture. Except for a single possible outlier for nullgrid.expand, which seems to become slower than the answer by paul as the number of unique values increase, we see that nullgrid.expand.dt remains faster, although here it seems to only be saving (exp(4) - exp(3.6)) / exp(3.6) ~ 50 % (or twice as fast) compared to paul's answer by the time we reach 10 unique values.

    Please note that I did not have enough RAM to run the benchmark for number of unique values or columns greater than the ones shown.

    Conclusion

    We've seen that there are many ways to reach the answer sought by the question, but as the number of columns and unique values increase the choice of method becomes more and more important. By utilizing optimized libraries, we can drastically reduce the time required to get the cross-join of all column values, with only minimal effort. With extended effort using Rcpp we could likely reduce the time complexity even further, while this is outside the scope of my answer.

    Benchmark code

    # Setup:
    set.seed(1234)
    library(tidyverse)
    library(data.table)
    nullgrid.expand <- function(df, ...)
      expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
    nullgrid.expand.dt <- function(df, ...)
      do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                           sorted = FALSE,
                           unique = TRUE))
    markers=LETTERS[1:5]
    plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
    names(plusminus_df)=LETTERS[1:5]
    
    bm <- microbenchmark(
      nullgrid.expand = nullgrid.expand(plusminus_df),
      nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),
      ekoam = unique(bind_rows(apply(
        plusminus_df, 1L, 
        function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
      ))),
      paul = {
        plusminus_df %>%
          add_row() %>%
          map(unique) %>%
          expand.grid()
      }, 
      control = list(warmup = 5)
    )
    library(ggplot2)
    autoplot(bm) + ggtitle('comparison between cross-join')
    

    Timing function

    time_function <- function(n = 5, j = 2){
      idx <- seq_len(n)
      df <- do.call(CJ, args = c(lapply(idx, function(x) as.character(seq_len(j))),
                                 sorted = FALSE,
                                 unique = TRUE))
      names(df) <- as.character(idx)
      microbenchmark(
        nullgrid.expand = nullgrid.expand(df),
        nullgrid.expand.dt = nullgrid.expand.dt(df),
        ekoam = unique(bind_rows(apply(
          df, 1L, 
          function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
        ))),
        paul = {
          df %>%
            add_row() %>%
            map(unique) %>%
            expand.grid()
        }, 
        times = 10,
        control = list(warmup = 5)
      )
    }
    res <- lapply(seq(3, 10), time_function)
    for(i in seq_along(res)){
      res[[i]]$n <- seq(3, 10)[i]
    }
    ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
      geom_smooth(se = FALSE) + 
      ggtitle('time-comparison given number of columns') + 
      labs(y = 'log(ms)', x = 'n')
    ggsave('so_2.png')
    
    res <- lapply(c(seq(2, 10, 2)), time_function, n = 5)
    for(i in seq_along(res)){
      res[[i]]$n <- seq(2, 10, 2)[i]
    }
    ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
      geom_smooth(se = FALSE) + 
      ggtitle('time-comparison given number of unique values') + 
      labs(y = 'log(ms)', x = 'n unique values per column')
    ggsave('so_3.png')