Search code examples
rfunctionregressionpredict

Automatically generate mean-adjusted data for regression in R


I am running into a bit of a roadblock with programming a data generating function for regression predictions. The normally way one would do what I am trying to do (without automating it like I am seeking to), is to do the following:

#### Fit Data ####
fit <- lm(Petal.Length ~ Petal.Width + Sepal.Width,iris)

#### Create Test Data ####
newdata <- data.frame(
  Petal.Width = mean(iris$Petal.Width),
  Sepal.Width = seq(
    min(iris$Sepal.Width),
    max(iris$Sepal.Width),
    length.out = 100
  )
)

#### Generate Predictions ####
pred <- predict(fit,newdata=newdata)
pred

The idea is that you select one variable of interest and control the other values by setting them to their mean, then predict the data. This consequently gives you the following predicted values:

       1        2        3        4        5        6        7        8 
4.133390 4.124783 4.116176 4.107569 4.098962 4.090355 4.081749 4.073142 
       9       10       11       12       13       14       15       16 
4.064535 4.055928 4.047321 4.038714 4.030107 4.021500 4.012893 4.004286 
      17       18       19       20       21       22       23       24 
3.995680 3.987073 3.978466 3.969859 3.961252 3.952645 3.944038 3.935431 
      25       26       27       28       29       30       31       32 
3.926824 3.918217 3.909611 3.901004 3.892397 3.883790 3.875183 3.866576 
      33       34       35       36       37       38       39       40 
3.857969 3.849362 3.840755 3.832148 3.823542 3.814935 3.806328 3.797721 
      41       42       43       44       45       46       47       48 
3.789114 3.780507 3.771900 3.763293 3.754686 3.746079 3.737473 3.728866 
      49       50       51       52       53       54       55       56 
3.720259 3.711652 3.703045 3.694438 3.685831 3.677224 3.668617 3.660010 
      57       58       59       60       61       62       63       64 
3.651404 3.642797 3.634190 3.625583 3.616976 3.608369 3.599762 3.591155 
      65       66       67       68       69       70       71       72 
3.582548 3.573941 3.565335 3.556728 3.548121 3.539514 3.530907 3.522300 
      73       74       75       76       77       78       79       80 
3.513693 3.505086 3.496479 3.487872 3.479266 3.470659 3.462052 3.453445 
      81       82       83       84       85       86       87       88 
3.444838 3.436231 3.427624 3.419017 3.410410 3.401803 3.393197 3.384590 
      89       90       91       92       93       94       95       96 
3.375983 3.367376 3.358769 3.350162 3.341555 3.332948 3.324341 3.315734 
      97       98       99      100 
3.307128 3.298521 3.289914 3.281307

However, I will probably have to do this over and over again and coding all of this by hand every time isn't going to be very efficient, so I am looking to automate it with a custom function.

Test Case

So far, this is what I have come up with to attempt automating the process, but it is obviously not helpful. The idea is for the function to take all but one of the variables as their mean, and afterwards select one variable as a sequenced number (from its min to its max) like what I have above. The generated data should also retain the names of the predictors plugged in (so they should say "test1" and so on when input into the function):

#### Create Test Data ####
test.data <- data.frame(
  test1 = rnorm(100),
  test2 = rnorm(100),
  test3 = rnorm(100),
  test4 = rnorm(100)
)

#### Make Function ####
gen.seq <- function(data,x1,x2,x3,x4){
  
  data <- data
  
  newdata <- data.frame(
    x1 = mean(data$x1, na.rm = T),
    x2 = mean(data$x2, na.rm = T),
    x3 = mean(data$x3, na.rm = T),
    x4 = seq(
      min(data$x4, na.rm = T),
      max(data$x4, na.rm = T),
      length.out = 100
    )
  )
}

#### Generate Mean Controlled Data ####
gen.seq(test.data,
        test1,
        test2,
        test3,
        test4)

I would also like it to include the predict function within this function if possible, but without accomplishing the data generation step first, it is futile to do at the moment. How do I accomplish this?


Solution

  • A more general/agnostic answer which simply creates the dataframes

    reps=3 # sequence length
    cols=c("test1","test2","test4") # columns to vary
    test.data.mean=as.data.frame.list(colMeans(test.data))
    
    sapply(
      cols,
      function(x){
        y=names(test.data.mean)[names(test.data.mean)!=x]
        z=setNames(data.frame(seq(min(test.data[x]),max(test.data[x]),length.out=reps)),x)
        z[y]=test.data.mean[y]
        z[colnames(test.data.mean)]
      },
      simplify=F,
      USE.NAMES=T
    )
    

    resulting in

    $test1
           test1       test2       test3       test4
    1 -1.9394516 -0.03640007 -0.04115825 -0.07265569
    2  0.1961531 -0.03640007 -0.04115825 -0.07265569
    3  2.3317578 -0.03640007 -0.04115825 -0.07265569
    
    $test2
            test1       test2       test3       test4
    1 -0.05502075 -2.66943429 -0.04115825 -0.07265569
    2 -0.05502075 -0.02634115 -0.04115825 -0.07265569
    3 -0.05502075  2.61675199 -0.04115825 -0.07265569
    
    $test4
            test1       test2       test3       test4
    1 -0.05502075 -0.03640007 -0.04115825 -2.60890222
    2 -0.05502075 -0.03640007 -0.04115825  0.01795227
    3 -0.05502075 -0.03640007 -0.04115825  2.64480676