Search code examples
routliers

Outlier removal via cooks distance and lm models


I try to program a way to remove outliers from a linear model. I want to be more flexible about the formulas I use for this purpose. But it does not work.

require(caret)
random_samples <- createDataPartition(iris$Sepal.Length, times=10, p=0.8)

getTrainTest <- function(Index, data){
  train_data <- data[Index, ] # test_data = Umfang von test_rowLocations --> Datensatz k
  test_data <- data[-Index, ] # training data = OG data frame - test data
  return(list("train"=train_data, "test"=test_data))
}

datasets <- lapply(random_samples, getTrainTest, iris)

forumla1 <- as.formula(Sepal.Length ~ Petal.Length)

compute_cooks_models <- function(x,eq){
  cooks.distance(lm(eq, 
                    data = x, na.action = na.exclude))}

result <- Map (compute_cooks_models,datasets, eq=forumla1)

Error: object of type 'symbol' is not subsettable

I don't get what I am doing wrong??

Could some one help me out? Nadine


Solution

  • You have a couple of issues there in your code.

    • datasets is a list of lists of dataframes, so when you loop through them with Map you are looping through the first level, thus passing in a list to the function compute_cooks_models. If you'd like to train the lm model with the training set then you have to use x$train in the argument data

    • The second issue is with the use of Map this function assumes that you're passing a vector or a list of values for each argument in the function. An example can be the following:

    my_fun <- function(x, y){
      paste0(x, y)
    }
    
    Map(my_fun, letters[1:5], 1:5)
    
    ## Output:
    # $a
    # [1] "a1"
    # 
    # $b
    # [1] "b2"
    # 
    # $c
    # [1] "c3"
    # 
    # $d
    # [1] "d4"
    # 
    # $e
    # [1] "e5"
    

    This means in your case that the function is trying to get the first element from datasets and the first element from forumla1, which will of course cause an error when passing one symbol of the formula to the lm call. You could instead use sapply which will do wht you need I think, like so:

    forumla1 <- as.formula(Sepal.Length ~ Petal.Length)
    
    compute_cooks_models <- function(x,eq){
      cooks.distance(lm(eq, data = x$train, na.action = na.exclude))
      }
    
    result <- sapply(datasets, compute_cooks_models, eq=forumla1)