Search code examples
rstatisticslinear-regression

OLSRR package trouble with many variables in R


I try to analyze all the variables in my data set to see which set of variables is describing my dependent variable StockPrice the best. The following code is the one I use to do so:

 install.packages("olsrr")
 library(olsrr)

model <- lm(StockPrice ~ ESGscore + MarketValue + ibc + ni + CommonEquity + AssetsTotal + ROA + ROE + MarketToBook + TobinQ + Liabilities + stock_ret_yr_0 + stock_ret_yr_minus1 + stock_ret_yr_plus1 + EPS + BookValuePS, data = Datensatz_Excel)
ols_step_best_subset(model)
A <- ols_step_best_subset(model)
plot(A)

And here is some data to reproduce it:

Please let me know if this works for you, I'm doing it for the first time. If there is a better way to provide some data (e.g. clearly arranged) other then using dput() please let me know! :)

structure(list(Company = c("AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC", "AIR PRODUCTS & CHEMICALS INC", 
"AIR PRODUCTS & CHEMICALS INC"), Year = c(2011, 2012, 2013, 2014, 
2015, 2016, 2017), gvkey = c(1209, 1209, 1209, 1209, 1209, 1209, 
1209), ggroup = c(1510, 1510, 1510, 1510, 1510, 1510, 1510), 
    ESGscore = c(84.2750015258789, 81.9225006103516, 77.4024963378906, 
    80.1125030517578, 78.6449966430664, 76.3775024414062, 79.2699966430664
    ), MarketValue = c(17934.369140625, 17537.578125, 23639.79296875, 
    30868.392578125, 28037.404296875, 31271.359375, 35903.4921875
    ), ibc = c(1252.59997558594, 1025.19995117188, 1042.5, 988.5, 
    1317.59997558594, 1545.69995117188, 1155.19995117188), ni = c(1224.19995117188, 
    1167.30004882812, 994.200012207031, 991.700012207031, 1277.90002441406, 
    631.099975585938, 3000.39990234375), CommonEquity = c(5795.7998046875, 
    6477.2001953125, 7042.10009765625, 7365.7998046875, 7249, 
    7079.60009765625, 10086.2001953125), AssetsTotal = c(14290.7001953125, 
    16941.80078125, 17850.099609375, 17779.099609375, 17438.099609375, 
    18055.30078125, 18467.19921875), ROA = c(0.0906418636441231, 
    0.0816824957728386, 0.0586832538247108, 0.0555571131408215, 
    0.0718765333294868, 0.0361908674240112, 0.166178345680237
    ), ROE = c(0.220699846744537, 0.201404482126236, 0.153492242097855, 
    0.140824466943741, 0.17349100112915, 0.0870602801442146, 
    0.423809230327606), MarketToBook = c(3.09437346458435, 2.70758628845215, 
    3.35692381858826, 4.19077253341675, 3.86776161193848, 4.41710805892944, 
    3.55966472625732), TobinQ = c(1.84940338134766, 1.65284550189972, 
    1.92983758449554, 2.32192254066467, 2.19212555885315, 2.33987021446228, 
    2.39800786972046), Liabilities = c(8494.900390625, 10464.6005859375, 
    10807.9995117188, 10413.2998046875, 10189.099609375, 10975.7006835938, 
    8380.9990234375), StockPrice = c(85.19, 84.02, 111.78, 144.23, 
    130.11, 143.82, 164.08), stock_ret_yr_0 = c(-0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351, 0.170138001441956), 
    stock_ret_yr_minus1 = c(0.150884702801704, -0.0378783643245697, 
    0.0164456591010094, 0.369286864995956, 0.321167588233948, 
    -0.076192781329155, 0.252576589584351), stock_ret_yr_plus1 = c(0.0164456591010094, 
    0.369286864995956, 0.321167588233948, -0.076192781329155, 
    0.252576589584351, 0.170138001441956, 0.00247942004352808
    ), EPS = c(5.75, 5.53, 4.74, 4.66, 5.95, 2.92, 13.76), BookValuePS = c(27.21, 
    30.67, 33.58, 34.63, 33.73, 32.72, 46.27)), row.names = c(NA, 
-7L), class = c("tbl_df", "tbl", "data.frame"))

The problem is, whenever R must analyze 16 different variables the program just does not work. R shows the code in the lower box and puts "model" in the data box, but nothing happens after that. There is no error message or something like that. I've also tried waiting for 15min. but nothing happend.

If I just analyze 4-5 variables there is no problem at all.

Someone out there with the same problem and maybe some solution? :)

Happy New Year everybody and thanks for the help :)


Solution

  • BLUF or the more trendy TL;DR I think that the function ols_step_best_subset has limitations. However, there are other ways to get what you are looking for.

    The long version:

    Okay, I worked with the data you provided, but I did not run into any of the issues you ran into. I thought it may have been due to how few rows of data you had provided. (You provided a lot of information! There just isn't a lot for the model to work with.)

    Instead of asking you for more data, since it seemed to be more of a question on the limitations of R, I found a built-in wide dataset. I still didn't run into the issues you ran into.

    I used the data dhfr from the package caret. It has over 200 variables; I randomly chose 24.

    I didn't clean it. I did look at the influential variables and that is likely going to be an issue. I did this as a way to look for multicollinearity which is a really big problem for linear regression. Since that wasn't a problem, I used this data.

    library(tidyverse)
    library(olsrr)
    library(caret) 
    library(randomForest)
    library(car)
    
    #------------------- Collect and Clean -------------------
    data(dhfr, package = "caret")
    
    # arbitrarily chose columns
    dhfr2 <- dhfr[, 2:25]
    
    # just checking what's there to work with
    summary(dhfr2)
    
    # check for multicollinearity, overly influential
    vif(lm(moeGao_Abra_L~., data = dhfr2))
    # if error, multicollinearity exists 
    #    (multiple variables with the same information)
    # high values are likely a source of a big problem in lm()
      # there are several with very high values here
    
    # data isn't clean or explored for actual analysis, 
    # but good enough to answer your inquiry
    
    #----------------- Prepare for Modeling ------------------
    set.seed(3926)
    # partition for training and testing
    tr <- createDataPartition(dhfr2[, 1], p = .7, list = F)
    
    #---------------- Linear Regression Model ----------------
    # use all remaining variables in dataset (24 predictors)
    fit.lm <- lm(moeGao_Abra_L~., data = dhfr2[tr, ])
    summary(fit.lm)
    

    The model had an explained variance (R2) is .9629.

    p <- predict(fit.lm, dhfr2[-tr, -1])
    postResample(p, dhfr2[-tr, 1])
    #      RMSE  Rsquared       MAE 
    # 0.2416366 0.9416441 0.1744155  
    # potentially an issue with overfitting
       # assumptions not assessed 
    

    If you want to assess a model this large instead of using ols_step_best_subset(), you could rfe().

    You would have to create the lm model through caret.

    set.seed(3926)
    fit.lmT <- train(moeGao_Abra_L~., data = dhfr2[tr, ],
                     method = "lm")
    

    You have to setup a controller first, but this is really just based on what type of model you are using. For lm, you really just need to know lmFuncs for lm functions.

    This is going to use cross-validation.

    ctrl = rfeControl(lmFuncs, 
                      method = "repeatedcv",
                      repeats = 5,
                      verbose = F)
    

    Then you can apply rfe().

    lmP <- rfe(x = dhfr2[tr, -1],
               y = dhfr2[tr, 1],
               sizes = 4:10,  
               rfeControl = ctrl)
    

    In the call to rfe(), the sizes argument is important. From help for this function: "a numeric vector of integers corresponding to the number of features that should be retained." This looked for a group of four, a group of five, all the way to a group of ten that worked best for the outcome variable. You have a lot more things you can control here, as well.

    You can read all the details about caret here.

    The results of rfe():

    # Recursive feature selection
    # 
    # Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
    # 
    # Resampling performance over subset size:
    # 
    #  Variables   RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected
    #          4 0.2814   0.9062 0.2243 0.05158    0.04136 0.04234         
    #          5 0.2788   0.9079 0.2192 0.04495    0.03832 0.03776         
    #          6 0.2743   0.9106 0.2149 0.04504    0.03794 0.03795         
    #          7 0.2670   0.9137 0.2093 0.04296    0.03875 0.03498         
    #          8 0.2542   0.9205 0.2008 0.04442    0.03920 0.03437         
    #          9 0.2335   0.9331 0.1851 0.04086    0.03124 0.03231         
    #         10 0.2203   0.9402 0.1754 0.03177    0.02793 0.02591         
    #         23 0.2034   0.9489 0.1618 0.03034    0.02410 0.02428        *
    # 
    # The top 5 variables (out of 23):
    #    moeGao_Abra_R, moe2D_GCUT_PEOE_3, moeGao_Abra_acidity, 
    #    moe2D_BCUT_SMR_0, moe2D_BCUT_SMR_3