Search code examples
rggplot2linear-regression

Why is the regression curve not (displayed) linear(ly)?


I have this data

  Resistance CO_part_l H2_part_l C2H2_part_l rH T_amb
1   7.334982  44.59499  2.33e+19    6.95e+17 36    25
2   7.192182  44.59499  2.33e+19    6.95e+17 36    25
3   7.548556  44.59499  2.33e+19    6.95e+17 36    25
4   7.287561  44.59499  2.33e+19    6.95e+17 36    25
5   5.476464  44.59499  2.33e+19    6.95e+17 36    25
6   5.433722  44.59499  2.33e+19    6.95e+17 36    25

and I wanna' use this model:

m4<- lm(Resistance ~ (CO_part_l + H2_part_l + C2H2_part_l + rH + T_amb), data = df)

then to predict the values via

pred_df <- data.frame(R_pred = predict(m4, df), CO_part_l = df$CO_part_l)

and plot it finally:

ggplot(df, aes(x = exp(CO_part_l), y = exp(Resistance))) + 
  geom_point(color = "blue", size = 3, alpha = 0.4) +
  geom_line(color='red',data = pred_df, aes(x=exp(CO_part_l), y=exp(R_pred)), alpha = 0.5, size =1.15) +
  theme_bw() + xlab(TeX("CO / [part/l]")) + ylab(TeX("R / $ \\Omega $ ")) + labs(title="CO")

and I don't understand why it looks like pieces of linear functions connected to each other..

plot

Note: Resistance and CO_part_l is logarithmized in the dataset because the relationship is logarithmic and to center it I have to do that in advance. That's why I exponentiate it in the plot then.

You can find the entire data here https://workupload.com/file/WuwqNeyKnAk I used the dput output, so I hope you can read it in.


Solution

  • If you want a single smooth line through the plot, you can hold the covariates steady (at their means, for example) while changing only the variable plotted on your x axis. In your case, the code to produce the prediction set might look something like this:

    pred_df <- do.call(rbind, lapply(seq(40, 45.2, 0.1), function(x)
      within(as.data.frame(t(colMeans(df)[3:6])), CO_part_l <- x)
    ))
    

    Now pred_df is a data frame of all your regressors held at their means apart from CO_part_l which is varied evenly throughout its range. We can use this to see how the output variable changes according to a change in CO_part_l when all else is equal:

    pred_df$R_pred <- predict(m4, newdata = pred_df)
    

    And that means your plot will look like this:

    ggplot(df, aes(x = exp(CO_part_l), y = exp(Resistance))) + 
      geom_point(color = "blue", size = 3, alpha = 0.4) +
      geom_line(color = 'red',data = pred_df, 
                aes(x = exp(CO_part_l), y = exp(R_pred)), 
                alpha = 0.5, size = 1.15) +
      theme_bw() + xlab(TeX("CO / [part/l]")) + 
      ylab(TeX("R / $ \\Omega $ ")) + 
      labs(title="CO")
    

    enter image description here

    This probably looks more convincing on a log scale (or just not exponentiating your y axis; I'm not sure of the physical relevance of the numbers, so I'll simply add a log scale here)

    ggplot(df, aes(x = exp(CO_part_l), y = exp(Resistance))) + 
      geom_point(color = "blue", size = 3, alpha = 0.4) +
      geom_line(color = 'red',data = pred_df, 
                aes(x = exp(CO_part_l), y = exp(R_pred)), 
                alpha = 0.5, size = 1.15) +
      theme_bw() + 
      xlab(TeX("CO / [part/l]")) + 
      ylab(TeX("R / $ \\Omega $ ")) + 
      labs(title="CO") +
      scale_y_log10()
    

    enter image description here

    And of course, making the x axis a scale_x_log10 would give a straight line, though not quite as nice a plot:

    ggplot(df, aes(x = exp(CO_part_l), y = exp(Resistance))) + 
      geom_point(color = "blue", size = 3, alpha = 0.4) +
      geom_line(color = 'red',data = pred_df, 
                aes(x = exp(CO_part_l), y = exp(R_pred)), 
                alpha = 0.5, size = 1.15) +
      theme_bw() + 
      xlab(TeX("CO / [part/l]")) + 
      ylab(TeX("R / $ \\Omega $ ")) + 
      labs(title="CO") +
      scale_y_log10() +
      scale_x_log10()
    

    enter image description here