Search code examples
rlapplysapplymapply

R multiply unequal dataframes based on specific column values


Updated question to include extra columns that pass from df1 to output, as well as add that the main df has 7 million rows.


I have two dataframes, similar to Pandas SO Question

And I need to multiply them where the names match. I'm not sure how to accomplish this cleanly. is there an apply function that does this easily?

DF1: (there are 16 columns of extra data, 7 million rows long)

Data1   Data2   Name      Value
aa      bb      sample1   50
ff      ff      sample1   100
ef      fd      sample1   75
ff      df      sample2   100
bbf     ad3     sample2   200
dd      a       sample2   300
33      3rf     sample3   25
ddd     dd      sample3   50
dd      dd      sample3   40

DF2:

Name      Value
sample1   1
sample2   0.5
sample3   2

OUTPUT: (With the additional 16 columns that aren't shown)

Data1   Data2   Name      Value
aa      bb      sample1   50
ff      ff      sample1   100
ef      fd      sample1   75
ff      df      sample2   50
bbf     ad3     sample2   100
dd      a       sample2   150
33      3rf     sample3   50
ddd     dd      sample3   100
dd      dd      sample3   80

Solution

  • We can join both the dataframes using a left_join or an inner_join and then multiply the corresponding Value columns with one another. Using dplyr that can be done as :

    library(dplyr)
    
    inner_join(df1, df2, by = 'Name') %>%
       mutate(Value = Value.x * Value.y) %>%
       select(names(df1))
    
    #     Name Value
    #1 sample1    50
    #2 sample1   100
    #3 sample1    75
    #4 sample2    50
    #5 sample2   100
    #6 sample2   150
    #7 sample3    50
    #8 sample3   100
    #9 sample3    80
    

    data

    df1 <- structure(list(Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
    3L, 3L), .Label = c("sample1", "sample2", "sample3"), class = "factor"), 
    Value = c(50L, 100L, 75L, 100L, 200L, 300L, 25L, 50L, 40L
    )), class = "data.frame", row.names = c(NA, -9L))
    
    df2 <- structure(list(Name = structure(1:3, .Label = c("sample1", "sample2", 
    "sample3"), class = "factor"), Value = c(1, 0.5, 2)), class = "data.frame", 
    row.names = c(NA, -3L))