Search code examples
rgroup-byplyr

Summarize Simultaneous group_by of 2 separate dataframes with common numerical field


I have a dataframe A consisting of 'name', 'measure'. I have another dataframe B consisting of 'type', 'measure'. I want to perform a summarise operation using each 'name' group of A on each 'type' group of B using the 'measure' field. For example:

A has:

Name | Measure
George   5
George   6
Tyrone   7
Tyrone   3

B has:

Type | Measure
cold      3
cold      2
hot       1
hot       5

I want to do a summarise on George and cold, George and hot, Tyrone and cold, Tyrone and hot, and in each summarise I find the minimum absolute difference (George on cold would be min(abs(5-3, 5-2, 6-3, 6-2)) = 2, and then find the 'Type' for each 'name' that has the lowest such score. How exactly do I do this for large datasets with many groups?


Solution

  • There's probably a simpler way, but you could do:

    library(tidyverse)
    
    crossing(
      distinct(A, Name),
      distinct(B, Type)
    ) %>%
      left_join(A, by = 'Name') %>%
      left_join(B, by = 'Type') %>%
      group_by(Name, Type) %>%
      summarise(minAbsDiff = min(abs(Measure.x - Measure.y))) %>%
      group_by(Name) %>%
      slice(which.min(minAbsDiff))
    

    Output:

    # A tibble: 2 x 3
    # Groups:   Name [2]
      Name   Type  minAbsDiff
      <fct>  <fct>      <int>
    1 George hot            0
    2 Tyrone cold           0