I have a middle sized dataset (> 20 million observations) and I want to use a shiny app to filter the desired information about specific subgroups in the dataset.
Right now, I want to assign for each observation a subgroup number (1 = subgroup1, 2 = subgroup2 and so forth). However, before the desired subgroups are determined and submitted via the shiny app interface, the total number of subgroups is unknown.
Therefore, I create a global variable that stores the number of subgroups after the information is submitted. With that information I would like to combine a for loop and the ifelse function to create a column in the dataset that stores the respective subgroup for each observation.
Imagine the case with the following sample data: Overall, the dataset contains three columns. The first column contains the unique id of each observation. The second column is titled "insurance" and contains a name of insurance companies. The third column is titled "response_code" and contains numbers between 1 and 5 (as a string). This column tells us how the observation rated its satisfaction with the insurance company.
As an example, the dataset might look like this:
ID | insurance | response_code |
---|---|---|
1 | Insurance_Company_1 | "1" |
2 | Insurance_Company_2 | "2" |
3 | Insurance_Company_4 | "3" |
4 | Insurance_Company_3 | "1" |
5 | Insurance_Company_6 | "2" |
6 | Insurance_Company_2 | "4" |
7 | Insurance_Company_1 | "5" |
8 | Insurance_Company_4 | "1" |
Code:
df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
insurance = c("Insurance_Company_1", "Insurance_Company_2",
"Insurance_Company_4", "Insurance_Company_3",
"Insurance_Company_6", "Insurance_Company_2",
"Insurance_Company_1", "Insurance_Company_4"),
response_code = c("1", "2", "3", "1", "2", "4", "5", "1"))
Besides, I have a few variables with characters that contain the name of certain insurance companies and certain response codes. These variables look like this:
labels_subgroup_1 <- c("Insurance_Company_1", "Insurance_Company_2")
labels_subgroup_2 <- c("Insurance_Company_3")
labels_subgroup_3 <- c("Insurance_Company_4", "Insurance_Company_5", "Insurance_Company_6").
response_codes_subgroup_1 <- c("1")
response_codes_subgroup_2 <- c("1", "2")
response_codes_subgroup_3 <- c("2", "3")
Last but not least, I have a variable that stores the total number of the desired subgroups:
number_of_subgroups <- 3
(This number is determined as soon as the data of the shiny app interface are submitted)
As a final result, I want to test for each observation: If the name of the insurance company is contained in one of the aforementioned labels_subgroup variables (e.g. Insurance_Company_2) AND the response_code is contained in the respective reponse_codes_subgroup variable (e.g. in the case of Insurance_Company_2 -> "2" or "3"), THEN the respective subgroup number is assigned to the observation. Otherwise, the subgroup number remains zero.
In the example, the desired result should look like this:
ID | insurance | response_code | subgroup_number |
---|---|---|---|
1 | Insurance_Company_1 | "1" | 1 |
2 | Insurance_Company_2 | "2" | 0 |
3 | Insurance_Company_4 | "3" | 3 |
4 | Insurance_Company_3 | "1" | 2 |
5 | Insurance_Company_6 | "2" | 3 |
6 | Insurance_Company_2 | "4" | 0 |
7 | Insurance_Company_1 | "5" | 0 |
8 | Insurance_Company_4 | "1" | 0 |
I already found a solution with two nested for-loops, however this approach is way too slow for this amount of data:
df$subgroup_number <- 0
for(i in 1:nrow(df)) {
for(j in 1:number_of_subgroups) {
if(df$subgroup_number[i] == 0 &&
df$insurance[i] %in% eval(parse(text = paste0("labels_subgroup_", j))) &&
df$response_code[i] %in% eval(parse(text = paste0("response_codes_subgroup_", j)))) {
df$subgroup_number[i] <- j
}
}
}
Consequently, I tried to implement an approach with mutate and ifelse() within a for loop. Unfortunately, only the last iteration of the loop is saved in the data:
df$subgroup_number <- 0
for(i in 1:number_of_subgroups) {
df <- df %>%
mutate(subgroup_number = ifelse(
subgroup_number == 0 &
insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))),
i, 0))
}
Result:
ID | insurance | response_code | subgroup_number |
---|---|---|---|
1 | Insurance_Company_1 | "1" | 0 |
2 | Insurance_Company_2 | "2" | 0 |
3 | Insurance_Company_4 | "3" | 3 |
4 | Insurance_Company_3 | "1" | 0 |
5 | Insurance_Company_6 | "2" | 3 |
6 | Insurance_Company_2 | "4" | 0 |
7 | Insurance_Company_1 | "5" | 0 |
8 | Insurance_Company_4 | "1" | 0 |
Is there a solution which saves the results of all iterations in the dataframe? Or is there a possible approach with the case_when function that might work within a for-loop? Thanks in advanace for your help.
For everyone who is still interested in the solution, I used the following approach:
df$subgroup_number <- 0
subgroup_number_list <- vector(mode = "list", length = number_of_subgroups)
for(i in 1:number_of_subgroups) {
subgroup_number_list[[i]] = ifelse(
df$insurance %in% eval(parse(text = paste0("labels_subgroup_", i))) &
df$response_code %in% eval(parse(text = paste0("response_codes_subgroup_", i))),
i, 0)
}
subgroup_number_df <- mapply(c, subgroup_number_list)
subgroup_number_vec <- rowSums(subgroup_number_df )
df$subgroup_number <- subgroup_number_vec
There might be a better or cleaner solution, but I am more than satisfied with the performance of my approach.