Search code examples
rfor-loopdplyrpurrrtidyeval

Create formulas from R dataframe of coefficients and variables


I am trying to create a function, formulator, to create R formulas out of a dataframe of responses, coefficients and constants and function names. My intent is to use it when converting large sheets of historical functions into useable R code. It is tedious and error-prone to rewrite each function as (response ~ constant + b1 x x1 + b2 x x2.....)

Example dataframe with same variables, but where not every variable was interesting (e.g. NA when unused) for every case. Every function has its' own row and every part its' own column, where the column name is the variable, and the cell is the coefficient. Not all coefficients are positive.

structure(list(species = c("Pine", "Spruce", "Birch", "Aspen", 
"Beech", "Oak", "Noble", "Trivial"), constant = c(-1.6952, -2.2827, 
-0.2269, -0.8198, 0.2081, 0.2348, 0.485, 1.9814), lndp1 = c(1.1617, 
1.4354, 1.1891, 1.4839, 1.7491, 1.2141, 1.0318, 0.8401), d = c(-0.0354, 
-0.0389, -0.0435, -0.024, -0.2167, NA, NA, NA), d2gt = c(0.2791, 
0.3106, 0.562, NA, NA, NA, NA, NA)), row.names = c(NA, -8L), class = c("tbl_df", 
"tbl", "data.frame"))

My idea was that since it is in a tidy order, I could write a function to do this for me, and reply with a printout like follows:

data %>% formulator(name_column=species, intercept_column=constant, response="Unknown")

In this case, there is no known response variable column, but I might know that all rows in this dataframe have the same response, which could be useful to type in by hand in quotations (tidyeval issue?).

Pine
Unknown ~ -1.6952 + 1.1617 x lndp1 + -0.0354 x d ....

Spruce
Unknown ~ ...

Here's my thinking so far:

formulator <- function(data, name_column, intercept_column){
  data1 <- data %>% select(-c(name_column, intercept_column))
  function_name <- data[,paste0(name_column)]
  intercepts <- data[,paste0(intercept_column)]

  varlist <- list()

  for(i in 1:dim(data1)[1]){
    data2 <- data1 %>% filter(name_column == paste0(function_name$i)) %>%  select_if(~!any(is.na(.)))
    datadim <- dim(data2)[2]
    for(coefs in 1:datadim){
      varlist[paste0(function_name$i)][coefs] <- paste0(data2[1,coefs])

    }
  }


}

This code is incomplete, but I think will manage to handle the varying lengths of each function to print, but I'm unsure of how to tie all this together.


Solution

  • I might suggest creating text versions of your formulas stored as a named vector, then just using as.formula(textVersion["foo"]) any time you needed a formula. Here's some code to give you the idea...

    library(tibble)
    library(dplyr)
    
    formulaData = tibble(
      species = c("Pine", "Spruce", "Birch", "Aspen", "Beech", "Oak", "Noble", "Trivial"), 
      constant = c(-1.6952, -2.2827, -0.2269, -0.8198, 0.2081, 0.2348, 0.485, 1.9814), 
      lndp1 = c(1.1617, 1.4354, 1.1891, 1.4839, 1.7491, 1.2141, 1.0318, 0.8401), 
      d = c(-0.0354, -0.0389, -0.0435, -0.024, -0.2167, NA, NA, NA),
      d2gt = c(0.2791, 0.3106, 0.562, NA, NA, NA, NA, NA)
    )
    
    rhs = 
      formulaData %>%
      select(!constant) %>%
      group_by(species) %>%
      group_map(
        function(x,y) 
          x[,!is.na(as.numeric(x))] %>%
          unlist %>%
          paste(names(.), sep = "*", collapse = " + ")
      ) %>%
      unlist %>%
      paste(" + ", formulaData$constant)
    
    textVersion = 
      paste("x ~", rhs) %>%
      structure(names = sort(formulaData$species))
    

    Example results:

    > textVersion
                                                      Aspen 
                  "x ~ 1.4839*lndp1 + -0.024*d  +  -1.6952" 
                                                      Beech 
                 "x ~ 1.7491*lndp1 + -0.2167*d  +  -2.2827" 
                                                      Birch 
    "x ~ 1.1891*lndp1 + -0.0435*d + 0.562*d2gt  +  -0.2269" 
                                                      Noble 
                             "x ~ 1.0318*lndp1  +  -0.8198" 
                                                        Oak 
                              "x ~ 1.2141*lndp1  +  0.2081" 
                                                       Pine 
    "x ~ 1.1617*lndp1 + -0.0354*d + 0.2791*d2gt  +  0.2348" 
                                                     Spruce 
     "x ~ 1.4354*lndp1 + -0.0389*d + 0.3106*d2gt  +  0.485" 
                                                    Trivial 
                              "x ~ 0.8401*lndp1  +  1.9814" 
    

    and

    > as.formula(textVersion["Oak"])
    x ~ 1.2141 * lndp1 + 0.2081
    

    If you really want a formulator function that returns a formula, I'd transpose your tibble:

    transposedData = 
      formulaData %>%
      select(!species) %>%
      unlist %>%
      matrix(ncol = 4, dimnames = list(formulaData$species, names(formulaData)[-1])) %>%
      t %>%
      as_tibble %>%
      mutate(term = names(formulaData)[-1]) %>%
      relocate(term, before = Pine)
    

    Which looks like this:

    > transposedData
    # A tibble: 4 x 9
      term        Pine  Spruce   Birch  Aspen  Beech    Oak  Noble Trivial
      <chr>      <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>
    1 constant -1.70   -2.28   -0.227  -0.820  0.208  0.235  0.485   1.98 
    2 lndp1     1.16    1.44    1.19    1.48   1.75   1.21   1.03    0.840
    3 d        -0.0354 -0.0389 -0.0435 -0.024 -0.217 NA     NA      NA    
    4 d2gt      0.279   0.311   0.562  NA     NA     NA     NA      NA    
    

    Then the function is pretty simple. Something like:

    formulator = function(.data, ID, lhs, constant = "constant") {
      terms = structure(
        paste(.data[[ID]], .data$term, sep = "*"),
        names = .data$term
      )
      terms = terms[!is.na(.data[[ID]])]
      cnst = which(names(terms) == constant)
      terms[cnst] = .data[[ID]][cnst]
      rhs = paste(terms, collapse = " + ")
      textVersion = paste(lhs, "~", rhs)
      as.formula(textVersion, env = parent.frame())
    }
    

    Here's an example application:

    > formulator(transposedData, "Beech", "myVariable")
    myVariable ~ 0.2081 + 1.7491 * lndp1 + -0.2167 * d
    

    I'm not sure I've completely understood your question or that the function I've written is what you want, but there are some coding examples that might help you design a solution.