Search code examples
rdataframelookup-tables

How to calculate a new column based on other columns using a lookup approach in R?


I am trying to calculate another column in a dataframe based on another columns and a lookup table. I have a simple example that only shows few data (my real dataset contains millions of rows).

I have the following datasets:

  lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"), 
           "condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))

  lookup
  class type condition1 condition2
      1    A         50         80
      2    B         60         85
      1    B         55         86
      2    A         53         83

My dataframe is of this shape:

  data<- data.frame("class"=c(1, 2, 2, 1, 2, 1), 
         "type"=c("A","B", "A", "A", "B", "B"), 
         "percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5), 
         "percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))


  data
  class type percentage_condition1 percentage_condition2
    1    A                   0.3                   0.7
    2    B                   0.6                   0.4
    2    A                   0.1                   0.9
    1    A                   0.2                   0.8
    2    B                   0.4                   0.6
    1    B                   0.5                   0.5

I would like to create a new column in my dataframe named data that will use the lookup table such as :

in my data where my class matches my type columns, it can calculate a new column in my dataframe data such as (not real code):

d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2

I know how to do it with a if else statement but I am trying to do it more efficiently as I am working with a lot of data. I know to do it with one column in the lookup table but I do not succeed with several columns (class and type column).

Thanks for any help and suggestions!


Solution

  • We can use match to get the index of 'type' columns for 'data' and 'type', use that index to get the corresponding rows of 'condition1', 'condition2' columns, multiply with the percentage columns of 'data' and get the rowSums

    data$new <- rowSums(lookup[match(paste(data$class, data$type), 
                      paste(lookup$class, lookup$type)), 
                   c("condition1", "condition2")] * data[3:4])
    
    data
    #  class type percentage_condition1 percentage_condition2  new
    #1     1    A                   0.3                   0.7 71.0
    #2     2    B                   0.6                   0.4 70.0
    #3     2    A                   0.1                   0.9 80.0
    #4     1    A                   0.2                   0.8 74.0
    #5     2    B                   0.4                   0.6 75.0
    #6     1    B                   0.5                   0.5 70.5
    

    NOTE: With match, we can do it much easier


    Or using data.table

    library(data.table)
    setDT(data)[lookup, new := condition1 * percentage_condition1 + 
           condition2 * percentage_condition2, on = .(class, type)]
    data
    #   class type percentage_condition1 percentage_condition2  new
    #1:     1    A                   0.3                   0.7 71.0
    #2:     2    B                   0.6                   0.4 70.0
    #3:     2    A                   0.1                   0.9 80.0
    #4:     1    A                   0.2                   0.8 74.0
    #5:     2    B                   0.4                   0.6 75.0
    #6:     1    B                   0.5                   0.5 70.5
    

    Or using tidyverse

    library(tidyverse)
    data %>% 
         left_join(lookup, by = c("class", "type")) %>%
         mutate(new = condition1 * percentage_condition1 + 
           condition2 * percentage_condition2) %>%
         select(names(data), new)
    #   class type percentage_condition1 percentage_condition2  new
    #1     1    A                   0.3                   0.7 71.0
    #2     2    B                   0.6                   0.4 70.0
    #3     2    A                   0.1                   0.9 80.0
    #4     1    A                   0.2                   0.8 74.0
    #5     2    B                   0.4                   0.6 75.0
    #6     1    B                   0.5                   0.5 70.5
    

    Or use a SQL based solution with sqldf

    library(sqldf)
    str1 <- "SELECT data.class, data.type, data.percentage_condition1, 
      data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 + 
       data.percentage_condition2 * lookup.condition2) as new
       FROM data 
       LEFT JOIN lookup on data.class = lookup.class AND 
       data.type = lookup.type"
    sqldf(str1)
    

    Or as @G.Grothendieck mentioned in the comments, with alias identifiers, sqldf solution can be made more compact

    sqldf("select D.*, L.condition1 * D.[percentage_condition1] + 
           L.condition2 * D.[percentage_condition2] as new 
           from data as D 
           left join lookup as L 
           using(class, type)")
    

    NOTE: All the solutions maintains the original order of the dataset