R glm - How to predict the same coefficient from two different dataset with the same data format and value

So let me explain my goal.

(1) I have an existing glm with P input variables which one of them is named 'X'.

(2) I have multiple datasets from different systems that each contains the 'X' input variable, but with a different names. Once I extract the dataset from the system, I am able to know the name of the variable that corresponds to 'X'.

(3) I want to use the predict(*) R function for each dataset. I wonder if there is a way to do so without renaming to 'X' the input variable such as adding a reference column name that the predict function can read instead of the raw column name. I guess if there is no way to do it, I will need to create a temporary dataset with the input variable renamed 'X', because I do not want to modify the columns name in the original dataset.

(4 Extra) I want to solve the same problem, but with multiple glm with the same input variable 'X' with different names.

Thanks you

Solution

I'm not sure I understand you correctly, but it sounds as though you want to be able to use predict.glm where the newdata argument contains an independent variable that is named differently to the equivalent independent variable in the data frame that was used to create the glm. However, you don't want to have to rename the column in the new data frame.

One way to do this is create a wrapper for predict that takes the names of the variable to be substituted in the new data frame, as well as the variable in the model it represents. Lets call it predict2:

predict2 <- function(model, newdata, oldvar, newvar, ...)
{
  if(!missing(oldvar) & !missing(newvar))
  {
    oldname <- deparse(substitute(oldvar))
    names(newdata)[which(names(newdata) == deparse(substitute(newvar)))] <- oldname
  }
  predict.glm(model, newdata = newdata, ...)
}

Now let's see how this would work. You haven't given us a reproducible example, but here's a very simple one. It's just a logistic regression on a three-level factor variable called X:

set.seed(69)
df1 <- data.frame(outcome = rbinom(15, 1, rep(c(.1, .5, .9), each = 5)),
                  X = rep(LETTERS[1:3], each = 5))

mod <- glm(outcome ~ X, df1, family = binomial)

predict(mod)
#>           1           2           3           4           5           6 
#> -19.5660685 -19.5660685 -19.5660685 -19.5660685 -19.5660685  -0.4054651 
#>           7           8           9          10          11          12 
#>  -0.4054651  -0.4054651  -0.4054651  -0.4054651   1.3862944   1.3862944 
#>          13          14          15 
#>   1.3862944   1.3862944   1.3862944

Now if we create a new data frame where the factor variable is called Y, which for illustration only contains B's, we have a problem when we try to use our model:

df2 <- data.frame(outcome = rbinom(15, 1, .5), Y = rep('B', 15))

predict(mod, newdata = df2)
#> Error in eval(predvars, data, env): object 'X' not found

However, with our new function predict2, we just tell it to use Y in place of X, and we should get our result (which, since they are all B's should all be -0.4054651:

predict2(mod, newdata = df2, oldvar = X, newvar = Y)
#>          1          2          3          4          5          6          7 
#> -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 
#>          8          9         10         11         12         13         14 
#> -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 -0.4054651 
#>         15 
#> -0.4054651

^{Created on 2020-05-10 by the reprex package (v0.3.0)}