Search code examples
riteratorsubsetnames

Is there a method for iterating data frame variables in a formula object?


In my case, I'm hoping to compute different glm and lda models for a certain subset. Y variable or output is the same in each model, but a forward best subset selection model is carried out for the variables found most significant in a random forest analysis.

However, when trying to iterate I can't find anything that could work as follows

#Ordered data frame (ordered_df_train) is just the data frame ordered using the previously mentioned #method, considering the first variable to be crim (the output)
list_formula <- vector(mode = "list", length = 13)
list_formula[[1]] <- ordered_df_train$crim ~ ordered_df_train$age
for(j in 3:14){
  list_formula[[j-1]] <- ordered_df_train$colnames(ordered_df_train)[j]
} 

However,

ordered_df_train$colnames(ordered_df_train)[j]

execution reports NULL, therefore, not taking the variable expected.

Edit: As suggested, the previously used data for reproducibility is defined as:

library(MASS)
df_train <- Boston
ordered_df_train <- data.frame(
    crim = df_train$crim,
    age = df_train$age,
    nox = df_train$nox,
    tax = df_train$tax,
    indus = df_train$indus,
    dis = df_train$dis,
    rad = df_train$rad,
    black = df_train$black,
    rm = df_train$rm,
    lstat = df_train$lstat,
    zn = df_train$zn,
    ptratio = df_train$ptratio,
    medv = df_train$medv,
    chas = df_train$chas
)

Hope this allows a execution of my question. The objective is to have a list of formulas based on the forward method for best subsect selection by adding after each iteration the next most significative variable.


Solution

  • Currently, you are not calling colnames properly. It is a base package method and not an element of a data frame accessed with $. Even so, you need to convert string values to formula such as with as.formula.

    Also, consider adjusting your call with lapply and avoid the bookkeeping of initializing a list and then iteratively assign elements by index. Use [-1] to subset out the first column name element.

    list_formula <- lapply(
      colnames(ordered_df_train)[-1],
      function(col) as.formula(
        paste("ordered_df_train$crim ~ ordered_df_train$", col)
      )
    )
    
    list_formula
    # [[1]]
    # ordered_df_train$crim ~ ordered_df_train$age
    # <environment: 0x000002842a33f240>
    #   
    # [[2]]
    # ordered_df_train$crim ~ ordered_df_train$nox
    # <environment: 0x000002842a32c270>
    #   
    # [[3]]
    # ordered_df_train$crim ~ ordered_df_train$tax
    # <environment: 0x000002843931fd10>
    #   
    # [[4]]
    # ordered_df_train$crim ~ ordered_df_train$indus
    # <environment: 0x00000284365dc340>
    #   
    # [[5]]
    # ordered_df_train$crim ~ ordered_df_train$dis
    # <environment: 0x00000284379d9800>
    #   
    # [[6]]
    # ordered_df_train$crim ~ ordered_df_train$rad
    # <environment: 0x00000284379d7fb8>
    #   
    # [[7]]
    # ordered_df_train$crim ~ ordered_df_train$black
    # <environment: 0x00000284393cf6e0>
    #   
    # [[8]]
    # ordered_df_train$crim ~ ordered_df_train$rm
    # <environment: 0x00000284379ef078>
    #   
    # [[9]]
    # ordered_df_train$crim ~ ordered_df_train$lstat
    # <environment: 0x000002843959d320>
    #   
    # [[10]]
    # ordered_df_train$crim ~ ordered_df_train$zn
    # <environment: 0x000002843959bad8>
    #   
    # [[11]]
    # ordered_df_train$crim ~ ordered_df_train$ptratio
    # <environment: 0x00000284393e4ba8>
    #   
    # [[12]]
    # ordered_df_train$crim ~ ordered_df_train$medv
    # <environment: 0x00000284366e3348>
    #   
    # [[13]]
    # ordered_df_train$crim ~ ordered_df_train$chas
    # <environment: 0x00000284364db798>
    

    Consider also reformulate and build formula without as.formula + paste. Below will not include the data frame qualifier but you may be able to pass data frame into the data argument of your modeling method.

    list_formula <- lapply(
      colnames(ordered_df_train)[-1], function(col) reformulate(col, "crim")
    )
    
    list_formula
    # [[1]]
    # crim ~ age
    # <environment: 0x000002843a203a18>
    #   
    # [[2]]
    # crim ~ nox
    # <environment: 0x000002843a20ad68>
    #   
    # [[3]]
    # crim ~ tax
    # <environment: 0x000002843a274678>
    #   
    # [[4]]
    # crim ~ indus
    # <environment: 0x000002843a279b18>
    #   
    # [[5]]
    # crim ~ dis
    # <environment: 0x000002843a282de8>
    #   
    # [[6]]
    # crim ~ rad
    # <environment: 0x000002843a286368>
    #   
    # [[7]]
    # crim ~ black
    # <environment: 0x000002843a2898e8>
    #   
    # [[8]]
    # crim ~ rm
    # <environment: 0x000002843a28ed88>
    #   
    # [[9]]
    # crim ~ lstat
    # <environment: 0x000002843a296138>
    #   
    # [[10]]
    # crim ~ zn
    # <environment: 0x000002843a2996b8>
    #   
    # [[11]]
    # crim ~ ptratio
    # <environment: 0x000002843a29eb58>
    #   
    # [[12]]
    # crim ~ medv
    # <environment: 0x000002843a2a5f08>
    #   
    # [[13]]
    # crim ~ chas
    # <environment: 0x000002843a2a9488>