Search code examples
rjoinrow-number

How to join data frames in R without duplicating original data values


I have 2 data frames (DF1 & DF2) and 1 would like to join them together by a unique value called "acc_num". In DF2, payment was made twice by acc_num A and thrice by B. Data frames are as follows.

DF1:

 acc_num     total_use    sales
    A           433        145
    A            NA          2
    A            NA         18
    B           149         32

DF2:

   acc       payment   
    A           150
    A            98
    B            44
    B            15
    B            10

My desired output is:

 acc_num    total_use    sales    payment
    A            433       145       150
    A             NA         2        98
    A             NA        18        NA
    B            149        32        44
    B             NA        NA        15
    B             NA        NA        10

I've tried full_join and merge but the output was not as desired. I couldn't work this out as I'm still a beginner in R, and haven't found the solution to this.

Example of the code I used was

test_full_join <- DF1 %>% full_join(DF2, by = c("acc_num" = "acc"))

The displayed output was:

 acc_num       total_use   sales     payment
    A             433       145        150
    A             433       145         98
    A              NA         2        150
    A              NA         2         98
    A              NA        18        150
    A              NA        18         98
    B             149        32         44
    B             149        32         15
    B             149        32         10

This is contrary to my desired output as at the end, my concern is to get the total sum of total_use, sales and payment. This output will definitely give me wrong interpretation for data visualization later on.


Solution

  • We may need to do a join by row_number() based on 'acc_num'

    library(dplyr)
    df1 %>%
         group_by(acc_num) %>%
         mutate(grpind = row_number()) %>% 
         full_join(df2 %>% 
                       group_by(acc_num = acc) %>%
                       mutate(grpind = row_number())) %>%
         select(acc_num, total_use, sales, payment)
    # A tibble: 6 x 4
    # Groups:   acc_num [2]
    #  acc_num total_use sales payment
    #  <chr>       <int> <int>   <int>
    #1 A             433   145     150
    #2 A              NA     2      98
    #3 A              NA    18      NA
    #4 B             149    32      44
    #5 B              NA    NA      15
    #6 B              NA    NA      10
    

    data

    df1 <- structure(list(acc_num = c("A", "A", "A", "B"), total_use = c(433L, 
    NA, NA, 149L), sales = c(145L, 2L, 18L, 32L)), class = "data.frame",
       row.names = c(NA, 
    -4L))
    
    df2 <- structure(list(acc = c("A", "A", "B", "B", "B"), payment = c(150L, 
    98L, 44L, 15L, 10L)), class = "data.frame", row.names = c(NA, 
    -5L))