Search code examples
rdplyrplyrtidyr

R: Create a new column in a dataframe, using column name, condition and value from another dataframe


Consider a base data frame as:

data <-  data.frame(amount_bin = c("10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+", "10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+", "10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+"),
                   risk_score = c("0-700", "700-750", "750-800", "800-850", "850-900", "0-700", "700-750", "750-800", "800-850", "850-900", "0-700", "700-750", "750-800", "800-850", "850-900"))

and grouping information in another data frame as:

group_info <- data.frame(variable = c("amount_bin_group", "amount_bin_group", "amount_bin_group", "amount_bin_group", "amount_bin_group",
                                 "risk_score_group", "risk_score_group", "risk_score_group", "risk_score_group", "risk_score_group"),
                    bin = c("10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+",
                            "0-700", "700-750", "750-800", "800-850", "850-900"),
                    group = c("1", "1", "2", "2", "3",
                              "a", "a", "a", "b", "b"))

I want to make 2 columns in base data frame (data) called "amount_bin_group" and "risk_score_group", which takes values from the column group_info$group when bin columns from group_info and data are the same. For simplicity, let's assume that the base column will always be the group_info$variable name minus the "group" string. That implies, when we want to create column amount_bin_group, base column will always be amount_bin in the base data frame.

The expected result data frame is:

final_data <-  data.frame(amount_bin = c("10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+", "10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+", "10K-25K", "25K-35K", "35K-45K", "45K-50K", "50K+"),
                   risk_score = c("0-700", "700-750", "750-800", "800-850", "850-900", "0-700", "700-750", "750-800", "800-850", "850-900", "0-700", "700-750", "750-800", "800-850", "850-900"),
                   amount_bin_group = c("1", "1", "2", "2", "3", "1", "1", "2", "2", "3", "1", "1", "2", "2", "3"),
                   risk_score_group = c("a", "a", "a", "b", "b", "a", "a", "a", "b", "b", "a", "a", "a", "b", "b"))

A solution that I just thought is iteratively merge the data frames i.e. :

final_data <- merge(data, group_info[, c("bin", "group")], by.x = "amount_bin", by.y = "bin")

final_data$amount_bin_group <- final_data$group
final_data$group <- NULL

But, I am sure there can be a more efficient solution. Please note that there are multiple such columns and not just two. So, maybe a loop would help.


Solution

  • You could just use a for loop to keep merging on the different sets:

    for (i in unique(group_info$variable)) {
      data <- merge(
        data, group_info[group_info$variable==i,c("bin","group")],
        by.x=sub("_group","",i), by.y="bin"
      )
      names(data)[names(data)=="group"] <- i
    }