Search code examples
rdecision-treeconditional-execution

Conditional execution in R based on decision tree


I have a CSV file with predictor variables like blood pressure (BP), heart rate (HR), weight, body surface area (BSA), body mass index (BMI), age, and gender.

There is a decision tree based algorithm for these variables that divides these patients into high risk yes/no category. So the HIGH_RISK is the last column i the CSV, and currently its empty. Now, even though I can use the algorithm for individual subjects (individual rows in the CSV file) to populate the HIGH_RISK column, but there are so many rows that doing that manually would be impractical.

If it were a simple addition, subtraction, multiplication etc, I would have done it in R and even in Excel. But since the algorithm involves a forking decision tree, I am not sure how to do it. But I am sure it is possible since R is so powerful. Any suggestions?

The decision tree is similar to this: http://www.scielo.br/img/revistas/sa/v70n6/a01fig04.jpg


Solution

  • You could use this helper function I wrote for you:

    decisionTree <- function(dataframe, lst) {
      if (!is.recursive(lst)) return(lst)
      values <- numeric(nrow(dataframe))
      indices <- eval(parse(text = names(lst)[1]), dataframe)
      values[indices] <- decisionTree(dataframe[indices, ], lst[[1]])
      values[!indices] <- decisionTree(dataframe[!indices, ], lst[[2]])
      values
    }
    

    The general format is to pass a data.frame as the first argument and a nested list representing the decision tree as the second argument, in a format like this:

     list("first_variable > 0.3" = 
             list("second_variable > 0.5" = 1,
                  "second_variable <= 0.5" = list(
                     "third_variable > 0.3" = 0,
                     1) # naming the negated condition is optional
                  ),
          "first_variable <= 0.3" = 0)
    

    Example

    iris$foo <- decisionTree(iris, list("Sepal.Length > 5" = list("Petal.Length > 1.3" = 1, 0), 0))
    head(iris) # All entries with Sepal.Length > 5 and Petal.Length > 1.3 will contain a 1.
    #      Sepal.Length Sepal.Width Petal.Length Petal.Width Species foo
    #    1          5.1         3.5          1.4         0.2  setosa   1
    #    2          4.9         3.0          1.4         0.2  setosa   0
    #    3          4.7         3.2          1.3         0.2  setosa   0
    #    4          4.6         3.1          1.5         0.2  setosa   0
    #    5          5.0         3.6          1.4         0.2  setosa   0
    #    6          5.4         3.9          1.7         0.4  setosa   1
    

    For the graph you provided, the second argument would look like:

    list("Ts_Armpit > 35.1" = 1,
      list("Ts_Breast <= 0.39" = list("Ts_Croup <= 28.9" = 1, 0),
        list("Ts_Groin <= 35.1" = 1, list("Ts_Armpit <= 33.7" = 1, 0))))
    

    where 1 indicates Discomfort and 0 indicates Comfort.