Search code examples
rlinear-regressionlinear-equation

How to perform linear regression for multiple columns and get a dataframe output with: regression equation and r squared value?


My dataframe looks like this

df = structure(list(Date_Time_GMT_3 = structure(c(1625025600, 1625026500,1625027400, 1625028300, 1625029200, 1625030100), 
                                                class = c("POSIXct", "POSIXt"), tzone = "EST"), 
                    X20676887_X2LH_S = c(26.879, 26.781,26.683, 26.585, 26.488, 26.39), 
                    X20819831_11LH_S = c(26.39, 26.292, 26.195, 26.195, 26.097, 26), 
                    X20822214_X4LH_S = c(26.39, 26.292,26.292, 26.195, 26.097, 26), 
                    LH27_20822244_U_Stationary = c(23.388, 23.292, 23.292, 23.196, 23.196, 23.196)), 
               row.names = 2749:2754, class = "data.frame")

and I'm trying to get the linear regression equations and R squared values for all columns where the column with the string "Stationary" in it will always be on the x-axis.

so far I can perform the linear regression for 1 column against the "stationary" column


model = lm(df$LH27_20822244_U_Stationary ~
             df$X20822214_X4LH_S, df)

and when I use

summary(model)

afterwards it gives me some values I would like in a dataframe (i.e R squared, Estimate Std., Std. Error, Pr(>|t|)) but 2 things I need to help with are:

  1. I still need the regression equation for each column that doesn't have stationary in the name
  2. I need these values for each of the columns that don't have stationary in it's name, and I need that to be a dataframe that looks like so...
 Logger_ID        Reg_equation R_Squared Estimate_Std. Std_Error  Pr_t..
  <chr>            <int>               <int>     <int>        <int>     <int>   
1 X20676887_X2LH_S NA                  NA        NA            NA         NA      
2 X20819831_11LH_S NA                  NA        NA            NA         NA      
3 X20822214_X4LH_S NA                  NA        NA            NA         NA      

Solution

  • Something like this:

    library(tidyverse)
    library(broom)
    df1 %>% 
      pivot_longer(
        cols = starts_with("X")
      ) %>% 
      mutate(name = factor(name)) %>% 
      group_by(name) %>% 
      group_split() %>% 
      map_dfr(.f = function(df){
        lm(LH27_20822244_U_Stationary ~ value, data = df) %>% 
          glance() %>% 
          # tidy() %>%  
          add_column(name = unique(df$name), .before=1)
      })
    

    Using tidy()

      name             term        estimate std.error statistic p.value
      <fct>            <chr>          <dbl>     <dbl>     <dbl>   <dbl>
    1 X20676887_X2LH_S (Intercept)   12.8      2.28        5.62 0.00494
    2 X20676887_X2LH_S value          0.393    0.0855      4.59 0.0101 
    3 X20819831_11LH_S (Intercept)   10.4      3.72        2.79 0.0495 
    4 X20819831_11LH_S value          0.492    0.142       3.47 0.0256 
    5 X20822214_X4LH_S (Intercept)   10.5      3.30        3.20 0.0329 
    6 X20822214_X4LH_S value          0.485    0.126       3.86 0.0182 
    

    Using glance()

      name          r.squared adj.r.squared  sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
      <fct>             <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
    1 X20676887_X2~     0.841         0.801 0.0350      21.1  0.0101     1   12.8 -19.6 -20.3  0.00490           4     6
    2 X20819831_11~     0.751         0.688 0.0438      12.0  0.0256     1   11.5 -17.0 -17.6  0.00766           4     6
    3 X20822214_X4~     0.788         0.735 0.0403      14.9  0.0182     1   12.0 -17.9 -18.6  0.00651           4     6