Search code examples
rdataframedplyr

How to apply "outer" on data frames in Tidyverse?


I have a data frame that contains measurements for multiple groups. I want to apply outer to each pair of groups and compute a summary. Can it be done in Tidyverse style? Below is an example of how to do it with two loops. I have played around with expand_grid and group_by, but without success.

df <- iris
groups = as.character(levels(df$Species))
n <- length(groups)
out <- matrix(numeric(n*n), nrow = n, ncol = n)
colnames(out) <- groups
rownames(out) <- groups
for (sp1 in 1:n) {
  for (sp2 in 1:n) {
    out[sp1, sp2] <- sum(outer(
      unlist(subset(df, Species == groups[sp1], Sepal.Length)),
      unlist(subset(df, Species == groups[sp2], Sepal.Length)),
      ">"))
  }
}
out
           setosa versicolor virginica
setosa       1128        139        34
versicolor   2302       1169       479
virginica    2457       1927      1165

Solution

  • 1) Base R With these sorts of calculations base R is probably more appropriate. We can split Sepal.Length into a 3 element list s, one component per Species, and then use outer or a double sapply.

    s <- with(iris, split(Sepal.Length, Species))
    outer(s, s, Vectorize(\(x, y) sum(outer(y, x, "<"))))
    
    ##            setosa versicolor virginica
    ## setosa       1128        139        34
    ## versicolor   2302       1169       479
    ## virginica    2457       1927      1165
    
    # alternately replace the line with outer with this
    sapply(s, \(u) sapply(s, \(v) sum(outer(u, v, "<"))))
    

    2) magrittr We could make this into a magrittr pipeline like this:

    library(magrittr)
    iris %>%
      with(split(Sepal.Length, Species)) %>%
      outer(., ., Vectorize(\(x, y) sum(outer(y, x, "<"))))
    

    3) tidyverse or to go full tidyverse

    library(dplyr)
    library(tidyr)
    
    iris %>%
      select(Sepal.Length, Species) %>%
      expand_grid(a = ., b = .) %>%
      unnest(c(a, b), names_sep = ".") %>%
      summarize(value = sum(a.Sepal.Length < b.Sepal.Length), 
                .by = ends_with("Species")) %>%
      pivot_wider(id_cols = b.Species, names_from = a.Species) %>%
      rename(Species = b.Species)
    
    ##   # A tibble: 3 × 4
    ##   Species    setosa versicolor virginica
    ##   <fct>       <int>      <int>     <int>
    ## 1 setosa       1128        139        34
    ## 2 versicolor   2302       1169       479
    ## 3 virginica    2457       1927      1165
    

    4) tapply Another base R approach is via tapply like this:

    with(iris, tapply(kronecker(Sepal.Length, Sepal.Length, "<"), 
      expand.grid(Species = Species, Species = Species), sum))
    ##             Species
    ## Species      setosa versicolor virginica
    ##   setosa       1128        139        34
    ##   versicolor   2302       1169       479
    ##   virginica    2457       1927      1165