Search code examples
rfor-loopif-statementdplyrmutate

Determine subgroups with mutate, a for loop and ifelse


I have a middle sized dataset (> 20 million observations) and I want to use a shiny app to filter the desired information about specific subgroups in the dataset.

Right now, I want to assign for each observation a subgroup number (1 = subgroup1, 2 = subgroup2 and so forth). However, before the desired subgroups are determined and submitted via the shiny app interface, the total number of subgroups is unknown.

Therefore, I create a global variable that stores the number of subgroups after the information is submitted. With that information I would like to combine a for loop and the ifelse function to create a column in the dataset that stores the respective subgroup for each observation.

Imagine the case with the following sample data: Overall, the dataset contains three columns. The first column contains the unique id of each observation. The second column is titled "insurance" and contains a name of insurance companies. The third column is titled "response_code" and contains numbers between 1 and 5 (as a string). This column tells us how the observation rated its satisfaction with the insurance company.

As an example, the dataset might look like this:

ID insurance response_code
1 Insurance_Company_1 "1"
2 Insurance_Company_2 "2"
3 Insurance_Company_4 "3"
4 Insurance_Company_3 "1"
5 Insurance_Company_6 "2"
6 Insurance_Company_2 "4"
7 Insurance_Company_1 "5"
8 Insurance_Company_4 "1"

Code:

df <- data.frame(ID = c(1,2,3,4,5,6,7,8), 
                 insurance = c("Insurance_Company_1", "Insurance_Company_2", 
                               "Insurance_Company_4", "Insurance_Company_3",
                               "Insurance_Company_6", "Insurance_Company_2",
                               "Insurance_Company_1", "Insurance_Company_4"),
                 response_code = c("1", "2", "3", "1", "2", "4", "5", "1"))

Besides, I have a few variables with characters that contain the name of certain insurance companies and certain response codes. These variables look like this:

labels_subgroup_1 <- c("Insurance_Company_1", "Insurance_Company_2")

labels_subgroup_2 <- c("Insurance_Company_3")

labels_subgroup_3 <- c("Insurance_Company_4", "Insurance_Company_5", "Insurance_Company_6").

response_codes_subgroup_1 <- c("1")

response_codes_subgroup_2 <- c("1", "2")

response_codes_subgroup_3 <- c("2", "3")

Last but not least, I have a variable that stores the total number of the desired subgroups:

number_of_subgroups <- 3 

(This number is determined as soon as the data of the shiny app interface are submitted)

As a final result, I want to test for each observation: If the name of the insurance company is contained in one of the aforementioned labels_subgroup variables (e.g. Insurance_Company_2) AND the response_code is contained in the respective reponse_codes_subgroup variable (e.g. in the case of Insurance_Company_2 -> "2" or "3"), THEN the respective subgroup number is assigned to the observation. Otherwise, the subgroup number remains zero.

In the example, the desired result should look like this:

ID insurance response_code subgroup_number
1 Insurance_Company_1 "1" 1
2 Insurance_Company_2 "2" 0
3 Insurance_Company_4 "3" 3
4 Insurance_Company_3 "1" 2
5 Insurance_Company_6 "2" 3
6 Insurance_Company_2 "4" 0
7 Insurance_Company_1 "5" 0
8 Insurance_Company_4 "1" 0

I already found a solution with two nested for-loops, however this approach is way too slow for this amount of data:

df$subgroup_number <- 0
for(i in 1:nrow(df)) {
  for(j in 1:number_of_subgroups) {
    if(df$subgroup_number[i] == 0 && 
       df$insurance[i] %in% eval(parse(text = paste0("labels_subgroup_", j))) && 
       df$response_code[i] %in% eval(parse(text = paste0("response_codes_subgroup_", j)))) {
       df$subgroup_number[i] <- j
    }
  }
}

Consequently, I tried to implement an approach with mutate and ifelse() within a for loop. Unfortunately, only the last iteration of the loop is saved in the data:

df$subgroup_number <- 0
for(i in 1:number_of_subgroups) {
  df <- df %>% 
    mutate(subgroup_number = ifelse(
    subgroup_number == 0 &
    insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
    response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))), 
    i, 0))
}

Result:

ID insurance response_code subgroup_number
1 Insurance_Company_1 "1" 0
2 Insurance_Company_2 "2" 0
3 Insurance_Company_4 "3" 3
4 Insurance_Company_3 "1" 0
5 Insurance_Company_6 "2" 3
6 Insurance_Company_2 "4" 0
7 Insurance_Company_1 "5" 0
8 Insurance_Company_4 "1" 0

Is there a solution which saves the results of all iterations in the dataframe? Or is there a possible approach with the case_when function that might work within a for-loop? Thanks in advanace for your help.


Solution

  • For everyone who is still interested in the solution, I used the following approach:

    df$subgroup_number <- 0
    subgroup_number_list <- vector(mode = "list", length = number_of_subgroups)
    for(i in 1:number_of_subgroups) {
      subgroup_number_list[[i]] = ifelse(
      df$insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
      df$response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))), 
      i, 0)
    }
    
    subgroup_number_df <- mapply(c, subgroup_number_list)
    
    subgroup_number_vec <- rowSums(subgroup_number_df )
    
    df$subgroup_number <- subgroup_number_vec
    

    There might be a better or cleaner solution, but I am more than satisfied with the performance of my approach.