Search code examples
rlinear-regression

Creating a linear regression model for each group in a column


I refer to this answer: https://stackoverflow.com/a/65076441/14436230

I am trying to predict the "Education" value for 2019 using past values for each year, using lm(Education ~ poly(TIME,2)).

However, I will have to apply this lm named function(TIME) to each "LOCATION", which I was able to create separate lm for each LOCATION in m.

Following the answer in the link attached, I was able to run my code until my_predict. When I run sapply , I get an error Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list"

Can someone advise me on my mistake? I will really appreciate any help.

enter image description here


linear_model <- function(TIME) lm(Education ~ poly(TIME,2), data=table2)

m <- lapply(split(table2,table2$LOCATION),linear_model)

new_df <- data.frame(TIME=c(2019))

my_predict <- function(TIME) predict(m,new_df)

sapply(m,my_predict)   #error here 

EDIT:

I am now able to predict education values for each "LOCATION" for 2020 and 2021 as shown below.

linear_model <- function(x) lm(Education ~ TIME, x)
m <- lapply(split(tableLinR,tableLinR$LOCATION),linear_model)
new_df <- data.frame(TIME=c(2020, 2021), row.names = c ("2020.Education", "2021.Education"))
my_predict <- function(x) predict(x,new_df)
result <- sapply(m,my_predict)

enter image description here

However, I actually wish to do this for more Independent Variables (e.g. Education, GDP, Hoursworked, PPI etc.) as shown in my column header:

enter image description here

Can someone advise me on how do I create a loop for my code to create a dataframe with the predicted values? I have struggled for so many hours but failed to do so.


Solution

  • You have some mistakes in the syntax of your functions. Functions are usually written as function(x), and then you substitute the x with the data you want to use it with.

    For example, in the linear_model function you defined, if you were to use it alone you would write:

    linear_model(data)
    

    However, because you are using it inside the lapply function it is a bit more tricky to see. Lapply is just making a loop and applying the linear_model function to each of the data frames you obtain from split(table2,table2$LOCATION).

    The same thing happens with my_predict.

    Anyway, this should work for you:

    linear_model <- function(x) lm(Education ~ TIME, x)
    
    m <- lapply(split(table2,table2$LOCATION),linear_model)
    
    new_df <- data.frame(TIME=c(2019))
    
    my_predict <- function(x) predict(x,new_df)
    
    sapply(m,my_predict)  
    

    ANSWER TO THE EDIT

    There are probably more efficient ways of looping the prediction, but here is my approach:

    pred_data <- list()
    
    for (i in 3:6){
       linear_model <- function(x) lm(x[,i] ~ TIME, x)
       m <- lapply(split(tableLinR,tableLinR$LOCATION),linear_model)
       new_df <- data.frame(TIME=c(2020, 2021), row.names = c("2020", "2021"))
       my_predict <- function(x) predict(x,new_df)
       pred_data[[colnames(tableLinR)[i]]] <- sapply(m,my_predict)
     }
    
     pred_data <- melt(pred_data)
     pred_data <- as.data.frame(pivot_wider(pred_data, names_from = L1, values_from = value))
    

    First you create an empty list where you will be saving the outputs of your loop. In for (i in 3:4) you put the interval of columns you want a prediction from. The result pred_data is a list that you can transform into a data frame in different ways. With melt and pivot_wider you obtain a format similar to your original data.