Search code examples
rlistlapplypredict

Add a new column generated from predict() to a list of dataframes


I have a logistic regression model. I would like to predict the morphology of items in multiple dataframes that have been put into a list.

I have lots of dataframes (most say working with a list of dataframes is better).

I need help with 1:

  1. Applying the predict function to a list of dataframes.
  2. Adding these predictions to their corresponding dataframe inside the list.

I am not sure whether it is better to have the 1000 dataframes separately and predict using loops etc, or to continue having them inside a list.

Prior to this code I have split my data into train and test sets. I then trained the model using:

library(nnet)
#Training the multinomial model
multinom_model <- multinom(Morphology ~ ., data=morph, maxit=500)

#Checking the model
summary(multinom_model)

This was then followed by validation etc.

My new dataset, consisting of multiple dataframes stored in a list, called rose.list was formatted by the following:

filesrose <- list.files(pattern = "_rose.csv")

#Rename all files of rose dataset 'rose.i'
for (i in seq_along(filesrose)) {
  assign(paste("rose", i, sep = "."), read.csv(filesrose[i]))
}

#Make a list of the dataframes
rose.list <- lapply(ls(pattern="rose."), function(x) get(x))

I have been using this function to predict on a singular new dataframe

# Predicting the classification for individual datasets
rose.1$Morph <- predict(multinom_model, newdata=rose.1, "class")

Which gives me the dataframe, with the new prediction column 'Morph'

But how would I do this for multiple dataframes in my rose.list? I have tried:

lapply(rose.list, predict(multinom_model, "class"))

Error in eval(predvars, data, env) : object 'Area' not found

and, but also has the error:

lapply(rose.list, predict(multinom_model, newdata = rose.list, "class"))

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows:

Solution

  • You can use an anonymous function (those with function(x) or abbreviated \(x)).

    library(nnet)
    multinom_model <- multinom(low ~ ., birthwt)
    lapply(df_list, \(x) predict(multinom_model, newdata=x, type='class'))
    
    # $rose_1
    # [1] 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0
    # [40] 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1
    # [79] 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0
    # [118] 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1
    # [157] 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1
    # Levels: 0 1
    # 
    # $rose_2
    # [1] 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 1
    # [40] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
    # [79] 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0
    # [118] 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0
    # [157] 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0
    # Levels: 0 1
    # 
    # $rose_3
    # [1] 0 0 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1
    # [40] 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1
    # [79] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 1
    # [118] 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
    # [157] 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0
    # Levels: 0 1
    

    update

    To add the predictions as new column to each data frame in the list, modify the code like so:

    res <- lapply(df_list, \(x) cbind(x, pred=predict(multinom_model, newdata=x, type="class")))
    
    lapply(res, head)
    # $rose_1
    #     low age lwt race smoke ptl ht ui ftv  bwt pred
    # 136   0  24 115    1     0   0  0  0   2 3090    0
    # 154   0  26 133    3     1   2  0  0   0 3260    0
    # 34    1  19 112    1     1   0  0  1   0 2084    1
    # 166   0  16 112    2     0   0  0  0   0 3374    0
    # 27    1  20 150    1     1   0  0  0   2 1928    1
    # 218   0  26 160    3     0   0  0  0   0 4054    0
    # 
    # $rose_2
    #     low age lwt race smoke ptl ht ui ftv  bwt pred
    # 167   0  16 135    1     1   0  0  0   0 3374    0
    # 26    1  25  92    1     1   0  0  0   0 1928    1
    # 149   0  23 119    3     0   0  0  0   2 3232    0
    # 98    0  22  95    3     0   0  1  0   0 2751    0
    # 222   0  31 120    1     0   0  0  0   2 4167    0
    # 220   0  22 129    1     0   0  0  0   0 4111    0
    # 
    # $rose_3
    #     low age lwt race smoke ptl ht ui ftv  bwt pred
    # 183   0  36 175    1     0   0  0  0   0 3600    0
    # 86    0  33 155    3     0   0  0  0   3 2551    0
    # 51    1  20 121    1     1   1  0  1   0 2296    1
    # 17    1  23  97    3     0   0  0  1   1 1588    1
    # 78    1  14 101    3     1   1  0  0   0 2466    1
    # 167   0  16 135    1     1   0  0  0   0 3374    0
    

    Data:

    data('birthwt', package='MASS')
    set.seed(42)
    df_list <- replicate(3, birthwt[sample(nrow(birthwt), replace=TRUE), ], simplify=FALSE) |>
      setNames(paste0('rose_', 1:3))