Search code examples
rindicator

R: Create Indicator Columns from list of conditions


I have a dataframe and a number of conditions. Each condition is supposed to check whether the value in a certain column of the dataframe is within a set of valid values.

This is what I tried:

# create the sample dataframe
age <- c(120, 45)
sex <- c("x", "f")

df <-data.frame(age, sex)

# create the sample conditions
conditions <- list(
  list("age", c(18:100)),
  list("sex", c("f", "m"))
)

addIndicator <- function (df, columnName, validValues) {
  indicator <- vector()

  for (row in df[, toString(columnName)]) {
    # for some strange reason, %in% doesn't work correctly here, but always returns FALSe
    indicator <- append(indicator, row %in% validValues)
  }
  df <- cbind(df, indicator)

  # rename the column
  names(df)[length(names(df))] <- paste0("I_", columnName)

  return(df)
}

for (condition in conditions){
  columnName <- condition[1]
  validValues <- condition[2]
  df <- addIndicator(df, columnName, validValues)
}

print(df)

However, this leads to all conditions considered not to be met - which is not what I expect:

  age sex I_age I_sex
1 120   x FALSE FALSE
2  45   f FALSE FALSE

I figured that %in% does not return the expected result. I checked for the typeof(row) and tried to boil this down into a minimum example. In a simple ME, with the same type and values of the variables, the %in% works properly. So, something must be wrong within the context I try to apply this. Since this is my first attempt to write anything in R, I am stuck here.

What am I doing wrong and how can I achieve what I want?


Solution

  • If you prefer an approach that uses the tidyverse family of packages:

    library(tidyverse)
    
    allowed_values <- list(age = 18:100, sex = c("f", "m"))
    
    df %>%
      imap_dfr(~ .x %in% allowed_values[[.y]]) %>%
      rename_with(~ paste0('I_', .x)) %>%
      bind_cols(df)
    

    imap_dfr allows you to manipulate each column in df using a lambda function. .x references the column content and .y references the name.

    rename_with renames the columns using another lambda function and bind_cols combines the results with the original dataframe.

    I borrowed the simplified list of conditions from ben's answer. I find my approach slightly more readable but that is a matter of taste and of whether you are already using the tidyverse elsewhere.