Search code examples
pythonrpandastidyverse

Is there anyway to merge two datasets that with two similar key column but not the same key column?


I have two datasets as follows:

a:

have lunch have dinner
set a set 1
set b set 2
... ...

b:

variable variable_value value
have lunch set a 0.2
have lunch set b 0.5
have dinner set 1 1.5
have dinner set 2 1.7
... ... ...

My expected outcome:

c:

have lunch have lunch value have dinner have dinner value
set a 0.2 set 1 1.5
set b 0.5 set 2 1.7
... ... ... ...

Is there anyway to merge these two datasets in python/r?

I can think of one solution in R:

library(tidyverse)
library(magrittr)
b %<>% pivot_wider(names_from = variable, values_from = value)

left_join(a,b)

Solution

  • You can melt, merge, then pivot:

    out = (a
      .melt(ignore_index=False, value_name='variable_value')
      .reset_index()
      .merge(b, how='left')
      .pivot(index='index', columns='variable')
    )
    

    Output:

             variable_value                  value           
    variable    have dinner have lunch have dinner have lunch
    index                                                    
    0                 set 1      set a         1.5        0.2
    1                 set 2      set b         1.7        0.5
    

    For a formatting closer to what you have, further reprocess the column names:

    out = (a.melt(ignore_index=False, value_name='variable_value')
      .reset_index()
      .merge(b, how='left')
      .pivot(index='index', columns='variable')
      .sort_index(level=1, axis=1)
      .rename_axis(index=None)
    )
    
    out.columns = out.columns.map(lambda x: f'{x[1]}{" value" if x[0] == "variable_value" else ""}')
    

    Output:

       have dinner have dinner value  have lunch have lunch value
    0          1.5             set 1         0.2            set a
    1          1.7             set 2         0.5            set b
    

    alternative

    Or, using a loop and concat:

    out = pd.concat([a[[col]].merge(b.loc[b['variable'].eq(col),
                                          ['variable_value', 'value']]
                                     .rename(columns={'variable_value': col,
                                                      'value': f'{col} value'
                                                     }),
                                   how='left'
                                   )
                     for col in a], axis=1)
    

    Output:

      have lunch  have lunch value have dinner  have dinner value
    0      set a               0.2       set 1                1.5
    1      set b               0.5       set 2                1.7